From 78b85166f2e05fe9079f7d2f5bf6d55de0d01156 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 29 Jun 2023 15:59:16 -0700 Subject: [PATCH 01/13] start with stable diffusion --- docs/source/en/api/pipelines/overview.mdx | 111 ++-------------- .../pipelines/stable_diffusion/text2img.mdx | 23 ++-- .../alt_diffusion/pipeline_alt_diffusion.py | 122 ++++++++---------- .../pipeline_alt_diffusion_img2img.py | 8 +- .../pipelines/audioldm/pipeline_audioldm.py | 8 +- .../controlnet/pipeline_controlnet.py | 17 +-- .../controlnet/pipeline_controlnet_img2img.py | 17 +-- .../controlnet/pipeline_controlnet_inpaint.py | 17 +-- .../pipeline_cycle_diffusion.py | 8 +- .../pipeline_flax_stable_diffusion.py | 61 ++++----- .../pipeline_stable_diffusion.py | 122 ++++++++---------- ...line_stable_diffusion_attend_and_excite.py | 8 +- .../pipeline_stable_diffusion_diffedit.py | 25 ++-- .../pipeline_stable_diffusion_img2img.py | 8 +- .../pipeline_stable_diffusion_inpaint.py | 8 +- ...ipeline_stable_diffusion_inpaint_legacy.py | 8 +- ...eline_stable_diffusion_instruct_pix2pix.py | 8 +- .../pipeline_stable_diffusion_k_diffusion.py | 8 +- .../pipeline_stable_diffusion_ldm3d.py | 25 ++-- ...pipeline_stable_diffusion_model_editing.py | 8 +- .../pipeline_stable_diffusion_panorama.py | 8 +- .../pipeline_stable_diffusion_paradigms.py | 25 ++-- .../pipeline_stable_diffusion_sag.py | 8 +- .../pipeline_stable_unclip.py | 8 +- .../pipeline_stable_unclip_img2img.py | 8 +- .../pipeline_text_to_video_synth.py | 17 +-- 26 files changed, 270 insertions(+), 424 deletions(-) diff --git a/docs/source/en/api/pipelines/overview.mdx b/docs/source/en/api/pipelines/overview.mdx index 1d61ae6a1314..a038a5753c58 100644 --- a/docs/source/en/api/pipelines/overview.mdx +++ b/docs/source/en/api/pipelines/overview.mdx @@ -12,108 +12,25 @@ specific language governing permissions and limitations under the License. # Pipelines -Pipelines provide a simple way to run state-of-the-art diffusion models in inference. -Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler -components - all of which are needed to have a functioning end-to-end diffusion system. +Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible, and they can be adapted to use different scheduler or even model components. -As an example, [Stable Diffusion](https://huggingface.co/blog/stable_diffusion) has three independently trained models: -- [Autoencoder](./api/models#vae) -- [Conditional Unet](./api/models#UNet2DConditionModel) -- [CLIP text encoder](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPTextModel) -- a scheduler component, [scheduler](./api/scheduler#pndm), -- a [CLIPImageProcessor](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPImageProcessor), -- as well as a [safety checker](./stable_diffusion#safety_checker). -All of these components are necessary to run stable diffusion in inference even though they were trained -or created independently from each other. +All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. -To that end, we strive to offer all open-sourced, state-of-the-art diffusion system under a unified API. -More specifically, we strive to provide pipelines that -- 1. can load the officially published weights and yield 1-to-1 the same outputs as the original implementation according to the corresponding paper (*e.g.* [LDMTextToImagePipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/latent_diffusion), uses the officially released weights of [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)), -- 2. have a simple user interface to run the model in inference (see the [Pipelines API](#pipelines-api) section), -- 3. are easy to understand with code that is self-explanatory and can be read along-side the official paper (see [Pipelines summary](#pipelines-summary)), -- 4. can easily be contributed by the community (see the [Contribution](#contribution) section). + -**Note** that pipelines do not (and should not) offer any training functionality. -If you are looking for *official* training examples, please have a look at [examples](https://github.com/huggingface/diffusers/tree/main/examples). +Pipelines do not offer any training functionality. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead! -## 🧨 Diffusers Summary + -The following table summarizes all officially supported pipelines, their corresponding paper, and if -available a colab notebook to directly try them out. +## DiffusionPipeline +[[autodoc]] DiffusionPipeline + - all + - __call__ + - device + - to + - components -| Pipeline | Paper | Tasks | Colab -|---|---|:---:|:---:| -| [alt_diffusion](./alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | - -| [audio_diffusion](./audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio_diffusion.git) | Unconditional Audio Generation | -| [controlnet](./api/pipelines/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb) -| [cycle_diffusion](./cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | -| [dance_diffusion](./dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | -| [ddpm](./ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | -| [ddim](./ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | -| [if](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) -| [if_img2img](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) -| [if_inpainting](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) -| [kandinsky](./kandinsky) | **Kandinsky** | Text-to-Image Generation | -| [kandinsky_inpaint](./kandinsky) | **Kandinsky** | Image-to-Image Generation | -| [kandinsky_img2img](./kandinsky) | **Kandinsksy** | Image-to-Image Generation | -| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | -| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | -| [latent_diffusion_uncond](./latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | -| [paint_by_example](./paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | -| [paradigms](./paradigms) | [**Parallel Sampling of Diffusion Models**](https://arxiv.org/abs/2305.16317) | Text-to-Image Generation | -| [pndm](./pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | -| [score_sde_ve](./score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [score_sde_vp](./score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [semantic_stable_diffusion](./semantic_stable_diffusion) | [**SEGA: Instructing Diffusion using Semantic Dimensions**](https://arxiv.org/abs/2301.12247) | Text-to-Image Generation | -| [stable_diffusion_adapter](./stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation with Adapters | - -| [stable_diffusion_text2img](./stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [stable_diffusion_img2img](./stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) -| [stable_diffusion_inpaint](./stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) -| [stable_diffusion_panorama](./stable_diffusion/panorama) | [**MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation**](https://arxiv.org/abs/2302.08113) | Text-Guided Panorama View Generation | -| [stable_diffusion_pix2pix](./stable_diffusion/pix2pix) | [**InstructPix2Pix: Learning to Follow Image Editing Instructions**](https://arxiv.org/abs/2211.09800) | Text-Based Image Editing | -| [stable_diffusion_pix2pix_zero](./stable_diffusion/pix2pix_zero) | [**Zero-shot Image-to-Image Translation**](https://arxiv.org/abs/2302.03027) | Text-Based Image Editing | -| [stable_diffusion_attend_and_excite](./stable_diffusion/attend_and_excite) | [**Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models**](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | -| [stable_diffusion_self_attention_guidance](./stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation | -| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | -| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | -| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation | -| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_safe](./stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) -| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation | -| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation | -| [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | -| [text_to_video_sd](./api/pipelines/text_to_video) | [**Modelscope's Text-to-video-synthesis Model in Open Domain**](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | -| [unclip](./unclip) | [**Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation | -| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | -| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Image Variations Generation | -| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | -| [vq_diffusion](./vq_diffusion) | [**Vector Quantized Diffusion Model for Text-to-Image Synthesis**](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | -| [text_to_video_zero](./text_to_video_zero) | [**Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators**](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation | +## FlaxDiffusionPipeline - -**Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers. - -However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the [Examples](#examples) below. - -## Pipelines API - -Diffusion models often consist of multiple independently-trained models or other previously existing components. - - -Each model has been trained independently on a different task and the scheduler can easily be swapped out and replaced with a different one. -During inference, we however want to be able to easily load all components and use them in inference - even if one component, *e.g.* CLIP's text encoder, originates from a different library, such as [Transformers](https://github.com/huggingface/transformers). To that end, all pipelines provide the following functionality: - -- [`from_pretrained` method](../diffusion_pipeline) that accepts a Hugging Face Hub repository id, *e.g.* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or a path to a local directory, *e.g.* -"./stable-diffusion". To correctly retrieve which models and components should be loaded, one has to provide a `model_index.json` file, *e.g.* [runwayml/stable-diffusion-v1-5/model_index.json](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), which defines all components that should be -loaded into the pipelines. More specifically, for each model/component one needs to define the format `: ["", ""]`. `` is the attribute name given to the loaded instance of `` which can be found in the library or pipeline folder called `""`. -- [`save_pretrained`](../diffusion_pipeline) that accepts a local path, *e.g.* `./stable-diffusion` under which all models/components of the pipeline will be saved. For each component/model a folder is created inside the local path that is named after the given attribute name, *e.g.* `./stable_diffusion/unet`. -In addition, a `model_index.json` file is created at the root of the local path, *e.g.* `./stable_diffusion/model_index.json` so that the complete pipeline can again be instantiated -from the local path. -- [`to`](../diffusion_pipeline) which accepts a `string` or `torch.device` to move all models that are of type `torch.nn.Module` to the passed device. The behavior is fully analogous to [PyTorch's `to` method](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to). -- [`__call__`] method to use the pipeline in inference. `__call__` defines inference logic of the pipeline and should ideally encompass all aspects of it, from pre-processing to forwarding tensors to the different models and schedulers, as well as post-processing. The API of the `__call__` method can strongly vary from pipeline to pipeline. *E.g.* a text-to-image pipeline, such as [`StableDiffusionPipeline`](./stable_diffusion) should accept among other things the text prompt to generate the image. A pure image generation pipeline, such as [DDPMPipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/ddpm) on the other hand can be run without providing any inputs. To better understand what inputs can be adapted for -each pipeline, one should look directly into the respective pipeline. - -**Note**: All pipelines have PyTorch's autograd disabled by decorating the `__call__` method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should -not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community). +[[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 0e3f51117555..02201f9c3a28 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -1,4 +1,4 @@ - -# Text-to-Image Generation +# Text-to-Image -## StableDiffusionPipeline +The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. -The Stable Diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photo-realistic images given any text input using Stable Diffusion. +| Stable Diffusion version | Repository | +|--------------------------|---------------------------------------------------------------------------------| +| v1 | [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) | +| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) | -The original codebase can be found here: -- *Stable Diffusion V1*: [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) -- *Stable Diffusion v2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) - -Available Checkpoints are: -- *stable-diffusion-v1-4 (512x512 resolution)* [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) -- *stable-diffusion-v1-5 (512x512 resolution)* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) -- *stable-diffusion-2-base (512x512 resolution)*: [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) -- *stable-diffusion-2 (768x768 resolution)*: [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) -- *stable-diffusion-2-1-base (512x512 resolution)* [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) -- *stable-diffusion-2-1 (768x768 resolution)*: [stabilityai/stable-diffusion-2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) +Additional official checkpoints for different versions of the Stable Diffusion model can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. [[autodoc]] StableDiffusionPipeline - all diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py index 507c082d9363..c0c3e9cd76e3 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py @@ -71,36 +71,32 @@ class AltDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL r""" Pipeline for text-to-image generation using Alt Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] - Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`RobertaSeriesModelWithTransformation`]): - Frozen text-encoder. Alt Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.RobertaSeriesModelWithTransformation), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`XLMRobertaTokenizer`): - Tokenizer of class - [XLMRobertaTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.XLMRobertaTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.XLMRobertaTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -196,42 +192,39 @@ def __init__( def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -542,78 +535,69 @@ def __call__( guidance_rescale: float = 0.0, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). guidance_rescale (`float`, *optional*, defaults to 0.7): - Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are - Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of - [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). - Guidance rescale factor should fix overexposure when using zero terminal SNR. + Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are + Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when + using zero terminal SNR. Examples: Returns: [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py index d6f297122ba4..47603919035c 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py @@ -226,10 +226,10 @@ def __init__( def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py index 6da8e809103e..fe204afa7436 100644 --- a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py +++ b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py @@ -93,17 +93,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet.py index d5646f3c43c1..42d2d14c44c5 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet.py @@ -181,17 +181,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -199,17 +197,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py index 9f989769a345..a161239b6767 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py @@ -207,17 +207,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -225,17 +223,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py index b3990fd0638a..7d9a8cd387fb 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py @@ -324,17 +324,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -342,17 +340,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py index 9a68c4d059c6..316a83125468 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py @@ -234,10 +234,10 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py index 3b4f77029ce4..40919cf40cb0 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py @@ -82,29 +82,28 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): r""" Pipeline for text-to-image generation using Stable Diffusion. - This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`FlaxAutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`FlaxCLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`FlaxUNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or [`FlaxDPMSolverMultistepScheduler`]. safety_checker ([`FlaxStableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -324,31 +323,35 @@ def __call__( jit: bool = False, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: - prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. + prompt (`str` or `List[str]`, *optional*): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. latents (`jnp.array`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image - generation. Can be used to tweak the same generation with different prompts. tensor will ge generated - by sampling using the supplied random `generator`. + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image + generation. Can be used to tweak the same generation with different prompts. If not provided, a latents + tensor is generated by sampling using the supplied random `generator`. jit (`bool`, defaults to `False`): - Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument - exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release. + Whether to run `pmap` versions of the generation and safety scoring functions. + + + + This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a + future release. + + + return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of a plain tuple. @@ -357,10 +360,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a - `tuple. When returning a tuple, the first element is a list with the generated images, and the second - element is a list of `bool`s denoting whether the corresponding generated image likely represents - "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is a list with the generated images + and the second element is a list of `bool`s indicating whether the corresponding generated image + contains "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py index 54927049571c..1c31670a80c8 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py @@ -73,36 +73,32 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo r""" Pipeline for text-to-image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] - Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -198,42 +194,39 @@ def __init__( def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -542,78 +535,69 @@ def __call__( guidance_rescale: float = 0.0, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). guidance_rescale (`float`, *optional*, defaults to 0.7): - Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are - Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of - [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). - Guidance rescale factor should fix overexposure when using zero terminal SNR. + Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are + Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when + using zero terminal SNR. Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py index 15a5d7eb1362..7541cbef88ab 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py @@ -236,17 +236,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py index b2d5953808f1..07fd7ca61ad0 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py @@ -370,17 +370,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -388,17 +386,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() @@ -406,10 +403,10 @@ def disable_vae_tiling(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py index ab4420f28386..e4c5928c08a4 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py @@ -230,10 +230,10 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py index 0750c40b66fb..11cd7851c754 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py @@ -298,10 +298,10 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py index 049e3d18f3de..c8dbb7321043 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py @@ -223,10 +223,10 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py index 341ff8daad42..9b6015a6cb24 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py @@ -431,10 +431,10 @@ def __call__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_k_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_k_diffusion.py index 29a57470a341..9e7a338c8bb3 100755 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_k_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_k_diffusion.py @@ -130,10 +130,10 @@ def set_scheduler(self, scheduler_type: str): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py index 95dd207f9d12..280779cd8539 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py @@ -160,17 +160,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -178,17 +176,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() @@ -196,10 +193,10 @@ def disable_vae_tiling(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py index 2ecb3f9dbaf7..113d11c6afcb 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py @@ -167,17 +167,15 @@ def append_ca(net_): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py index 37e705d1bc5a..a0cd1444f394 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py @@ -129,17 +129,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py index 073f02e8ee98..ae7cdc4cda42 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py @@ -149,17 +149,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -167,17 +165,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() @@ -185,10 +182,10 @@ def disable_vae_tiling(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py index 9c583de9ca9c..9d5bc8bdd8df 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py @@ -148,17 +148,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py index 7c89bfedbd59..73dddafa9172 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py @@ -145,17 +145,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py index 003c82ff4f8a..cdf4254301a1 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py @@ -147,17 +147,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py index dad7d5639892..b6600803747e 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py @@ -116,17 +116,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -134,17 +132,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() From 3a810e520efb5dd05fb18ec038d413981cc347d8 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 29 Jun 2023 16:23:48 -0700 Subject: [PATCH 02/13] fix --- docs/source/en/api/pipelines/stable_diffusion/text2img.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 02201f9c3a28..71d0094be588 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -1,4 +1,4 @@ -* -# Depth-to-Image Generation +# Depth-to-Image -## StableDiffusionDepth2ImgPipeline - -The depth-guided stable diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/), as part of Stable Diffusion 2.0. It uses [MiDas](https://github.com/isl-org/MiDaS) to infer depth based on an image. - -[`StableDiffusionDepth2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the images’ structure. +The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. -The original codebase can be found here: -- *Stable Diffusion v2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) +The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) and additional official checkpoints for depth-to-image can be found [here](https://huggingface.co/stabilityai/stable-diffusion-2-depth). -Available Checkpoints are: -- *stable-diffusion-2-depth*: [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) +## StableDiffusionDepth2ImgPipeline [[autodoc]] StableDiffusionDepth2ImgPipeline - all @@ -34,3 +28,5 @@ Available Checkpoints are: - load_textual_inversion - load_lora_weights - save_lora_weights + +## StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx index 8ca69ff69aec..831d16f1317f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx @@ -12,15 +12,11 @@ specific language governing permissions and limitations under the License. # Image Variation -## StableDiffusionImageVariationPipeline - -[`StableDiffusionImageVariationPipeline`] lets you generate variations from an input image using Stable Diffusion. It uses a fine-tuned version of Stable Diffusion model, trained by [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/). +The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model from [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/). -The original codebase can be found here: -[Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) +The original codebase can be found at [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). -Available Checkpoints are: -- *sd-image-variations-diffusers*: [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers) +## StableDiffusionImageVariationPipeline [[autodoc]] StableDiffusionImageVariationPipeline - all @@ -29,3 +25,7 @@ Available Checkpoints are: - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention + +## StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx index c70f9ac9dcb7..d99c4535ba29 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx @@ -10,18 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Image-to-Image Generation +# Image-to-Image -## StableDiffusionImg2ImgPipeline +The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. The original codebase can be found at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion/blob/main/scripts/img2img.py). -The Stable Diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionImg2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images using Stable Diffusion. +The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon). -The original codebase can be found here: [CampVis/stable-diffusion](https://github.com/CompVis/stable-diffusion/blob/main/scripts/img2img.py) +The abstract from the paper is: -[`StableDiffusionImg2ImgPipeline`] is compatible with all Stable Diffusion checkpoints for [Text-to-Image](./text2img) +*Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.* -The pipeline uses the diffusion-denoising mechanism proposed by SDEdit ([SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://arxiv.org/abs/2108.01073) -proposed by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon). +## StableDiffusionImg2ImgPipeline [[autodoc]] StableDiffusionImg2ImgPipeline - all @@ -35,6 +34,16 @@ proposed by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan - load_lora_weights - save_lora_weights +## StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput + +## FlaxStableDiffusionImg2ImgPipeline + [[autodoc]] FlaxStableDiffusionImg2ImgPipeline - all - __call__ + +## FlaxStableDiffusionPipelineOutput + +[[autodoc]] FlaxStableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx index 39e5ae0fd37d..44e2e5c464ac 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx @@ -10,19 +10,28 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Text-Guided Image Inpainting +# Inpaint -## StableDiffusionInpaintPipeline +The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. + +| Stable Diffusion version | Repository | +|--------------------------|------------------------------------------------------------------------------------------------------------------------| +| v1 | [CompVis/stable-diffusion](https://github.com/runwayml/stable-diffusion#inpainting-with-stable-diffusion) | +| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-inpainting-with-stable-diffusion) | + +Additional official checkpoints for different versions of the Stable Diffusion model for inpainting can be found on the [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. -The Stable Diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionInpaintPipeline`] lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. + -The original codebase can be found here: -- *Stable Diffusion V1*: [CampVis/stable-diffusion](https://github.com/runwayml/stable-diffusion#inpainting-with-stable-diffusion) -- *Stable Diffusion V2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-inpainting-with-stable-diffusion) +It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such +as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default +text-to-image Stable Diffusion checkpoints, such as +[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible with +this pipeline but might be less performant. -Available checkpoints are: -- *stable-diffusion-inpainting (512x512 resolution)*: [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting) -- *stable-diffusion-2-inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) + + +## StableDiffusionInpaintPipeline [[autodoc]] StableDiffusionInpaintPipeline - all @@ -35,6 +44,16 @@ Available checkpoints are: - load_lora_weights - save_lora_weights +## StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput + +## FlaxStableDiffusionInpaintPipeline + [[autodoc]] FlaxStableDiffusionInpaintPipeline - all - __call__ + +## FlaxStableDiffusionPipelineOutput + +[[autodoc]] FlaxStableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx index 61fd2f799114..55aa603f9e6a 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx @@ -10,18 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Stable Diffusion Latent Upscaler +# Latent Upscaler -## StableDiffusionLatentUpscalePipeline - -The Stable Diffusion Latent Upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It can be used on top of any [`StableDiffusionUpscalePipeline`] checkpoint to enhance its output image resolution by a factor of 2. - -A notebook that demonstrates the original implementation can be found here: -- [Stable Diffusion Upscaler Demo](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) +The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2. -Available Checkpoints are: -- *stabilityai/latent-upscaler*: [stabilityai/sd-x2-latent-upscaler](https://huggingface.co/stabilityai/sd-x2-latent-upscaler) +The [Stable Diffusion Upscaler Demo](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) demonstrates the original implementation. +## StableDiffusionLatentUpscalePipeline [[autodoc]] StableDiffusionLatentUpscalePipeline - all @@ -30,4 +25,8 @@ Available Checkpoints are: - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention \ No newline at end of file + - disable_xformers_memory_efficient_attention + +# StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx index d311fdb5f4f6..2da653ffa141 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx @@ -12,23 +12,11 @@ specific language governing permissions and limitations under the License. # LDM3D -LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal -The abstract of the paper is the following: +LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./stable_diffusion/overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. -*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* - - -*Overview*: - -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [pipeline_stable_diffusion_ldm3d.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py) | *Text-to-Image Generation* | - | - - -## Tips - -- LDM3D generates both an image and a depth map from a given text prompt, compared to the existing txt-to-img diffusion models such as [Stable Diffusion](./stable_diffusion/overview) that generates only an image. -- With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. +The abstract from the paper is: +*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]: @@ -43,13 +31,14 @@ rgb_image[0].save("lemons_ldm3d_rgb.jpg") depth_image[0].save("lemons_ldm3d_depth.png") ``` +## StableDiffusionLDM3DPipeline -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput +[[autodoc]] StableDiffusionLDM3DPipeline - all - __call__ -## StableDiffusionLDM3DPipeline -[[autodoc]] StableDiffusionLDM3DPipeline +## StableDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput - all - __call__ diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx index 6ba805cf445d..48d3ec0680d7 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx @@ -10,225 +10,27 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Stable diffusion 2 +# Stable Diffusion 2 -Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of [Stable Diffusion 1](https://stability.ai/blog/stable-diffusion-public-release). -The project to train Stable Diffusion 2 was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). +Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). *The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* -For more details about how Stable Diffusion 2 works and how it differs from Stable Diffusion 1, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-v2-release). +For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-v2-release). -## Tips + -### Available checkpoints: +The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./stable_diffusion/text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler. -Note that the architecture is more or less identical to [Stable Diffusion 1](./stable_diffusion/overview) so please refer to [this page](./stable_diffusion/overview) for API documentation. + -- *Text-to-Image (512x512 resolution)*: [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) with [`StableDiffusionPipeline`] -- *Text-to-Image (768x768 resolution)*: [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) with [`StableDiffusionPipeline`] -- *Image Inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) with [`StableDiffusionInpaintPipeline`] -- *Super-Resolution (x4 resolution resolution)*: [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) [`StableDiffusionUpscalePipeline`] -- *Depth-to-Image (512x512 resolution)*: [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) with [`StableDiffusionDepth2ImagePipeline`] +Stable Diffusion 2 is available for a tasks like inpainting, super-resolution and depth-to-image: -We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler there is. - - -### Text-to-Image - -- *Text-to-Image (512x512 resolution)*: [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) with [`StableDiffusionPipeline`] - -```python -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -import torch - -repo_id = "stabilityai/stable-diffusion-2-base" -pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -prompt = "High quality photo of an astronaut riding a horse in space" -image = pipe(prompt, num_inference_steps=25).images[0] -image.save("astronaut.png") -``` - -- *Text-to-Image (768x768 resolution)*: [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) with [`StableDiffusionPipeline`] - -```python -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -import torch - -repo_id = "stabilityai/stable-diffusion-2" -pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -prompt = "High quality photo of an astronaut riding a horse in space" -image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0] -image.save("astronaut.png") -``` - -#### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed": - -The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)** -claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. - -The abstract reads as follows: - -*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), -and some implementations of diffusion samplers do not start from the last timestep. -Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. -We show that the flawed design causes real problems in existing implementations. -In Stable Diffusion, it severely limits the model to only generate images with medium brightness and -prevents it from generating very bright and dark samples. We propose a few simple fixes: -- (1) rescale the noise schedule to enforce zero terminal SNR; -- (2) train the model with v prediction; -- (3) change the sampler to always start from the last timestep; -- (4) rescale classifier-free guidance to prevent over-exposure. -These simple changes ensure the diffusion process is congruent between training and inference and -allow the model to generate samples more faithful to the original data distribution.* - -You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]: -- (1) rescale the noise schedule to enforce zero terminal SNR; -```py -pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True) -``` -- (2) train the model with v prediction; -Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) -and `--prediction_type="v_prediction"`. -- (3) change the sampler to always start from the last timestep; -```py -pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") -``` -- (4) rescale classifier-free guidance to prevent over-exposure. -```py -pipe(..., guidance_rescale=0.7) -``` - -An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2) -which has been fine-tuned using the `"v_prediction"`. - -The checkpoint can then be run in inference as follows: - -```py -from diffusers import DiffusionPipeline, DDIMScheduler - -pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16) -pipe.scheduler = DDIMScheduler.from_config( - pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing" -) -pipe.to("cuda") - -prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k" -image = pipeline(prompt, guidance_rescale=0.7).images[0] -``` - -## DDIMScheduler -[[autodoc]] DDIMScheduler - -### Image Inpainting - -- *Image Inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) with [`StableDiffusionInpaintPipeline`] - -```python -import PIL -import requests -import torch -from io import BytesIO - -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler - - -def download_image(url): - response = requests.get(url) - return PIL.Image.open(BytesIO(response.content)).convert("RGB") - - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = download_image(img_url).resize((512, 512)) -mask_image = download_image(mask_url).resize((512, 512)) - -repo_id = "stabilityai/stable-diffusion-2-inpainting" -pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -prompt = "Face of a yellow cat, high resolution, sitting on a park bench" -image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] - -image.save("yellow_cat.png") -``` - -### Super-Resolution - -- *Image Upscaling (x4 resolution resolution)*: [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) with [`StableDiffusionUpscalePipeline`] - - -```python -import requests -from PIL import Image -from io import BytesIO -from diffusers import StableDiffusionUpscalePipeline -import torch - -# load model and scheduler -model_id = "stabilityai/stable-diffusion-x4-upscaler" -pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16) -pipeline = pipeline.to("cuda") - -# let's download an image -url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" -response = requests.get(url) -low_res_img = Image.open(BytesIO(response.content)).convert("RGB") -low_res_img = low_res_img.resize((128, 128)) -prompt = "a white cat" -upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] -upscaled_image.save("upsampled_cat.png") -``` - -### Depth-to-Image - -- *Depth-Guided Text-to-Image*: [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) [`StableDiffusionDepth2ImagePipeline`] - - -```python -import torch -import requests -from PIL import Image - -from diffusers import StableDiffusionDepth2ImgPipeline - -pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-depth", - torch_dtype=torch.float16, -).to("cuda") - - -url = "http://images.cocodataset.org/val2017/000000039769.jpg" -init_image = Image.open(requests.get(url, stream=True).raw) -prompt = "two tigers" -n_propmt = "bad, deformed, ugly, bad anotomy" -image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0] -``` - -### How to load and use different schedulers. - -The stable diffusion pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the stable diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc. -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following: - -```python ->>> from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler - ->>> pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2") ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) - ->>> # or ->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-2", subfolder="scheduler") ->>> pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", scheduler=euler_scheduler) -``` +| Task | Repository | +|-------------------------|---------------------------------------------------------------------------------------------------------------| +| text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) | +| text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) | +| inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) | +| super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | +| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx index 035c7155ef93..0800e6115810 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx @@ -12,42 +12,24 @@ specific language governing permissions and limitations under the License. # Safe Stable Diffusion -Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://arxiv.org/abs/2211.05105) and mitigates the well known issue that models like Stable Diffusion that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, or otherwise offensive content. -Safe Stable Diffusion is an extension to the Stable Diffusion that drastically reduces content like this. +Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105) and mitigates inappropriate degeneration from Stable Diffusion models because they're trained on unfiltered web-crawled datasets. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, and otherwise offensive content. Safe Stable Diffusion is an extension of Stable Diffusion that drastically reduces this type of content. -The abstract of the paper is the following: +The abstract from the paper is: *Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.* +Use the `safety_concept` property of [`StableDiffusionPipelineSafe`] to check and edit the current safety concept: -*Overview*: - -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [pipeline_stable_diffusion_safe.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) | [![Huggingface Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion) - -## Tips - -- Safe Stable Diffusion may also be used with weights of [Stable Diffusion](./stable_diffusion/text2img). - -### Run Safe Stable Diffusion - -Safe Stable Diffusion can be tested very easily with the [`StableDiffusionPipelineSafe`], and the `"AIML-TUDA/stable-diffusion-safe"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation). - -### Interacting with the Safety Concept - -To check and edit the currently used safety concept, use the `safety_concept` property of [`StableDiffusionPipelineSafe`]: ```python >>> from diffusers import StableDiffusionPipelineSafe >>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") >>> pipeline.safety_concept +'an image showing hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality, cruelty' ``` For each image generation the active concept is also contained in [`StableDiffusionSafePipelineOutput`]. -### Using pre-defined safety configurations - -You may use the 4 configurations defined in the [Safe Latent Diffusion paper](https://arxiv.org/abs/2211.05105) as follows: +There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyConfig.STRONG`, and `SafetyConfig.MAX`) that can be applied: ```python >>> from diffusers import StableDiffusionPipelineSafe @@ -58,33 +40,14 @@ You may use the 4 configurations defined in the [Safe Latent Diffusion paper](ht >>> out = pipeline(prompt=prompt, **SafetyConfig.MAX) ``` -The following configurations are available: `SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyConfig.STRONG`, and `SafetyConfig.MAX`. - -### How to load and use different schedulers - -The safe stable diffusion pipeline uses [`PNDMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the stable diffusion pipeline such as [`DDIMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc. -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following: - -```python ->>> from diffusers import StableDiffusionPipelineSafe, EulerDiscreteScheduler - ->>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) - ->>> # or ->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("AIML-TUDA/stable-diffusion-safe", subfolder="scheduler") ->>> pipeline = StableDiffusionPipelineSafe.from_pretrained( -... "AIML-TUDA/stable-diffusion-safe", scheduler=euler_scheduler -... ) -``` - +## StableDiffusionPipelineSafe -## StableDiffusionSafePipelineOutput -[[autodoc]] pipelines.stable_diffusion_safe.StableDiffusionSafePipelineOutput +[[autodoc]] StableDiffusionPipelineSafe - all - __call__ -## StableDiffusionPipelineSafe -[[autodoc]] StableDiffusionPipelineSafe +## StableDiffusionSafePipelineOutput + +[[autodoc]] pipelines.stable_diffusion_safe.StableDiffusionSafePipelineOutput - all - __call__ diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 71d0094be588..05b50d8595ce 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -12,7 +12,12 @@ specific language governing permissions and limitations under the License. # Text-to-Image -The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. +The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. + +The abstract from the paper is: + +*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .* + | Stable Diffusion version | Repository | |--------------------------|---------------------------------------------------------------------------------| @@ -21,6 +26,8 @@ The Stable Diffusion model was created by researchers and engineers from [CompVi Additional official checkpoints for different versions of the Stable Diffusion model can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. +## StableDiffusionPipeline + [[autodoc]] StableDiffusionPipeline - all - __call__ @@ -37,6 +44,16 @@ Additional official checkpoints for different versions of the Stable Diffusion m - load_lora_weights - save_lora_weights +## StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput + +## FlaxStableDiffusionPipeline + [[autodoc]] FlaxStableDiffusionPipeline - all - __call__ + +## FlaxStableDiffusionPipelineOutput + +[[autodoc]] FlaxStableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx index f70d8f445fd9..394054a9ca8b 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx @@ -10,18 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Super-Resolution +# Super Resolution -## StableDiffusionUpscalePipeline - -The upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/), as part of Stable Diffusion 2.0. [`StableDiffusionUpscalePipeline`] can be used to enhance the resolution of input images by a factor of 4. - -The original codebase can be found here: -- *Stable Diffusion v2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-upscaling-with-stable-diffusion) +The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4. -Available Checkpoints are: -- *stabilityai/stable-diffusion-x4-upscaler (x4 resolution resolution)*: [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) +The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-upscaling-with-stable-diffusion) and additional official checkpoints for super resolution can be found at [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). +## StableDiffusionUpscalePipeline [[autodoc]] StableDiffusionUpscalePipeline - all @@ -29,4 +24,8 @@ Available Checkpoints are: - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention \ No newline at end of file + - disable_xformers_memory_efficient_attention + +## StableDiffusionPipelineOutput + +[[autodoc]] StableDiffusionPipelineOutput \ No newline at end of file diff --git a/src/diffusers/pipelines/alt_diffusion/__init__.py b/src/diffusers/pipelines/alt_diffusion/__init__.py index dab2d8db1045..987f39e78f50 100644 --- a/src/diffusers/pipelines/alt_diffusion/__init__.py +++ b/src/diffusers/pipelines/alt_diffusion/__init__.py @@ -16,11 +16,11 @@ class AltDiffusionPipelineOutput(BaseOutput): Args: images (`List[PIL.Image.Image]` or `np.ndarray`) - List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, - num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. + List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, + num_channels)`. nsfw_content_detected (`List[bool]`) - List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, or `None` if safety checking could not be performed. + List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or + `None` if safety checking could not be performed. """ images: Union[List[PIL.Image.Image], np.ndarray] diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py index c0c3e9cd76e3..480596bda5b3 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py @@ -539,8 +539,8 @@ def __call__( Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py index 47603919035c..8eb4a68ee99c 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py @@ -99,38 +99,34 @@ class AltDiffusionImg2ImgPipeline( DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin ): r""" - Pipeline for text-guided image to image generation using Alt Diffusion. + Pipeline for text-guided image-to-image generation using Alt Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] - Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`RobertaSeriesModelWithTransformation`]): - Frozen text-encoder. Alt Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.RobertaSeriesModelWithTransformation), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`XLMRobertaTokenizer`): - Tokenizer of class - [XLMRobertaTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.XLMRobertaTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.XLMRobertaTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -587,74 +583,66 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch, that will be used as the starting point for the - process. Can also accpet image latents as `image`, if passing latents directly, it will not be encoded - again. + `Image` or tensor representing an image batch to be used as the starting point. Can also accept image + latents as `image`, if passing latents directly, it will not be encoded again. strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in - `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. This parameter will be modulated by `strength`. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + Examples: Returns: [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs. Raise error if not correct self.check_inputs(prompt, strength, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds) diff --git a/src/diffusers/pipelines/pipeline_flax_utils.py b/src/diffusers/pipelines/pipeline_flax_utils.py index e1c4b9f53953..f5e7880da1cd 100644 --- a/src/diffusers/pipelines/pipeline_flax_utils.py +++ b/src/diffusers/pipelines/pipeline_flax_utils.py @@ -92,18 +92,17 @@ class FlaxImagePipelineOutput(BaseOutput): class FlaxDiffusionPipeline(ConfigMixin): r""" - Base class for all models. + Base class for Flax-based pipelines. - [`FlaxDiffusionPipeline`] takes care of storing all components (models, schedulers, processors) for diffusion - pipelines and handles methods for loading, downloading and saving models as well as a few methods common to all - pipelines to: + [`FlaxDiffusionPipeline`] stores all components (models, schedulers, and processors) for diffusion pipelines and + provides methods for loading, downloading and saving models. It also includes methods to: - enabling/disabling the progress bar for the denoising iteration Class attributes: - - **config_name** ([`str`]) -- name of the config file that will store the class and module names of all - components of the diffusion pipeline. + - **config_name** ([`str`]) -- The configuration filename that stores the class and module names of all the + diffusion pipeline's components. """ config_name = "model_index.json" @@ -143,10 +142,9 @@ def register_modules(self, **kwargs): def save_pretrained(self, save_directory: Union[str, os.PathLike], params: Union[Dict, FrozenDict]): # TODO: handle inference_state """ - Save all variables of the pipeline that can be saved and loaded as well as the pipelines configuration file to - a directory. A pipeline variable can be saved and loaded if its class implements both a save and loading - method. The pipeline can easily be re-loaded using the `[`~FlaxDiffusionPipeline.from_pretrained`]` class - method. + Save all saveable variables of the pipeline to a directory. A pipeline variable can be saved and loaded if its + class implements both a save and loading method. The pipeline is easily reloaded using the + [`~FlaxDiffusionPipeline.from_pretrained`] class method. Arguments: save_directory (`str` or `os.PathLike`): @@ -193,70 +191,61 @@ def save_pretrained(self, save_directory: Union[str, os.PathLike], params: Union @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], **kwargs): r""" - Instantiate a Flax diffusion pipeline from pre-trained pipeline weights. + Instantiate a Flax diffusion pipeline from pretrained pipeline weights. - The pipeline is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated). + The pipeline is set in evaluation mode - `model.eval()` - by default, and dropout modules are deactivated. - The warning *Weights from XXX not initialized from pretrained model* means that the weights of XXX do not come - pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning - task. + If you get the error message below, you need to finetune the weights for your downstream task: - The warning *Weights from XXX not used in YYY* means that the layer XXX is not used by YYY, therefore those - weights are discarded. + ``` + Some weights of FlaxUNet2DConditionModel were not initialized from the model checkpoint at runwayml/stable-diffusion-v1-5 and are newly initialized because the shapes did not match: + ``` Parameters: pretrained_model_name_or_path (`str` or `os.PathLike`, *optional*): Can be either: - - A string, the *repo id* of a pretrained pipeline hosted inside a model repo on - https://huggingface.co/ Valid repo ids have to be located under a user or organization name, like - `CompVis/ldm-text2im-large-256`. - - A path to a *directory* containing pipeline weights saved using - [`~FlaxDiffusionPipeline.save_pretrained`], e.g., `./my_pipeline_directory/`. + - A string, the *repo id* (for example `runwayml/stable-diffusion-v1-5`) of a pretrained pipeline + hosted on the Hub. + - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved + using [`~FlaxDiffusionPipeline.save_pretrained`]. dtype (`str` or `jnp.dtype`, *optional*): - Override the default `jnp.dtype` and load the model under this dtype. If `"auto"` is passed the dtype - will be automatically derived from the model's weights. + Override the default `jnp.dtype` and load the model under this dtype. If `"auto"`, the dtype is + automatically derived from the model's weights. force_download (`bool`, *optional*, defaults to `False`): Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist. resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to delete incompletely received files. Will attempt to resume the download if such a - file exists. + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. output_loading_info(`bool`, *optional*, defaults to `False`): Whether or not to also return a dictionary containing missing keys, unexpected keys and error messages. - local_files_only(`bool`, *optional*, defaults to `False`): - Whether or not to only look at local files (i.e., do not try to download the model). + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to `True`, the model + won't be downloaded from the Hub. use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated - when running `huggingface-cli login` (stored in `~/.huggingface`). + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in ~/.huggingface) is used. revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a - git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any - identifier allowed by git. + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. mirror (`str`, *optional*): - Mirror source to accelerate downloads in China. If you are from China and have an accessibility - problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. - Please refer to the mirror site for more information. specify the folder name here. - + Mirror source to resolve accessibility issues if you're downloading a model in China. We do not + guarantee the timeliness or safety of the source, and you should refer to the mirror site for more + information. kwargs (remaining dictionary of keyword arguments, *optional*): - Can be used to overwrite load - and saveable variables - *i.e.* the pipeline components - of the - specific pipeline class. The overwritten components are then directly passed to the pipelines - `__init__` method. See example below for more information. - - - - It is required to be logged in (`huggingface-cli login`) when you want to use private or [gated - models](https://huggingface.co/docs/hub/models-gated#gated-models), *e.g.* `"runwayml/stable-diffusion-v1-5"` - - + Can be used to overwrite load and saveable variables (the pipeline components) of the specific pipeline + class. The overwritten components are passed directly to the pipelines `__init__` method. - Activate the special ["offline-mode"](https://huggingface.co/diffusers/installation.html#offline-mode) to use - this method in a firewalled environment. + To use private or [gated models](https://huggingface.co/docs/hub/models-gated#gated-models), log-in with + `huggingface-cli login`. You can also activate the special + [“offline-mode”](https://huggingface.co/diffusers/installation.html#offline-mode) to use this method in a + firewalled environment. @@ -540,7 +529,7 @@ def components(self) -> Dict[str, Any]: @staticmethod def numpy_to_pil(images): """ - Convert a numpy image or a batch of images to a PIL image. + Convert a NumPy image or a batch of images to a PIL image. """ if images.ndim == 3: images = images[None, ...] diff --git a/src/diffusers/pipelines/stable_diffusion/__init__.py b/src/diffusers/pipelines/stable_diffusion/__init__.py index 33ab05a1dacb..1219f44dc8eb 100644 --- a/src/diffusers/pipelines/stable_diffusion/__init__.py +++ b/src/diffusers/pipelines/stable_diffusion/__init__.py @@ -25,11 +25,11 @@ class StableDiffusionPipelineOutput(BaseOutput): Args: images (`List[PIL.Image.Image]` or `np.ndarray`) - List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, - num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. + List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, + num_channels)`. nsfw_content_detected (`List[bool]`) - List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, or `None` if safety checking could not be performed. + List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or + `None` if safety checking could not be performed. """ images: Union[List[PIL.Image.Image], np.ndarray] @@ -116,14 +116,14 @@ class StableDiffusionPipelineOutput(BaseOutput): @flax.struct.dataclass class FlaxStableDiffusionPipelineOutput(BaseOutput): """ - Output class for Stable Diffusion pipelines. + Output class for Flax-based Stable Diffusion pipelines. Args: images (`np.ndarray`) - Array of shape `(batch_size, height, width, num_channels)` with images from the diffusion pipeline. + Denoised images of array shape of `(batch_size, height, width, num_channels)`. nsfw_content_detected (`List[bool]`) - List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content. + List indicating whether the corresponding generated image contains "not-safe-for-work" + (nsfw) content or `None` if safety checking could not be performed. """ images: np.ndarray diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py index 40919cf40cb0..2eea5730999d 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py @@ -93,7 +93,7 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): tokenizer (`CLIPTokenizer`): A [`~transformers.CLIPTokenizer`] to tokenize text. unet ([`FlaxUNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or @@ -327,8 +327,8 @@ def __call__( Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py index 6a387af364b7..db5c97b14ffd 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py @@ -104,31 +104,29 @@ class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline): r""" - Pipeline for image-to-image generation using Stable Diffusion. + Pipeline for text-guided image-to-image generation using Stable Diffusion. - This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`FlaxAutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. - text_encoder ([`FlaxCLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. + text_encoder ([`~transformers.FlaxCLIPTextModel`]): + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`FlaxUNet2DConditionModel`]): + A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or - [`FlaxDPMSolverMultistepScheduler`]. + `FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], or [`FlaxPNDMScheduler`]. safety_checker ([`FlaxStableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -353,52 +351,58 @@ def __call__( jit: bool = False, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt_ids (`jnp.array`): - The prompt or prompts to guide the image generation. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`jnp.array`): - Array representing an image batch, that will be used as the starting point for the process. - params (`Dict` or `FrozenDict`): Dictionary containing the model parameters/weights - prng_seed (`jax.random.KeyArray` or `jax.Array`): Array containing random number generator key + Array representing an image batch to be used as the starting point. + params (`Dict` or `FrozenDict`): + Dictionary containing the model parameters/weights + prng_seed (`jax.random.KeyArray` or `jax.Array`): + Array containing random number generator key strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. + expense of slower inference. This parameter is modulated by `strength`. height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The width in pixels of the generated image. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. noise (`jnp.array`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image - generation. Can be used to tweak the same generation with different prompts. tensor will ge generated - by sampling using the supplied random `generator`. + Pre-generated noisy latents sampled from a Gaussian distribution to be used as inputs for image + generation. Can be used to tweak the same generation with different prompts. The array is generated by + sampling using the supplied random `generator`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of a plain tuple. jit (`bool`, defaults to `False`): - Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument - exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release. + Whether to run `pmap` versions of the generation and safety scoring functions. + + + + This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a + future release. + + Examples: Returns: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a - `tuple. When returning a tuple, the first element is a list with the generated images, and the second - element is a list of `bool`s denoting whether the corresponding generated image likely represents - "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is a list with the generated images + and the second element is a list of `bool`s indicating whether the corresponding generated image + contains "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py index abb57f8b62e9..3cdac8e29885 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py @@ -101,31 +101,35 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): r""" - Pipeline for text-guided image inpainting using Stable Diffusion. *This is an experimental feature*. + Pipeline for text-guided image inpainting using Stable Diffusion. - This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + + + 🧪 This is an experimental feature! + + + + This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`FlaxAutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. - text_encoder ([`FlaxCLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. + text_encoder ([`~transformers.FlaxCLIPTextModel`]): + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`FlaxUNet2DConditionModel`]): + A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or - [`FlaxDPMSolverMultistepScheduler`]. + `FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], or [`FlaxPNDMScheduler`]. safety_checker ([`FlaxStableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -408,27 +412,31 @@ def __call__( Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. latents (`jnp.array`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image - generation. Can be used to tweak the same generation with different prompts. tensor will ge generated - by sampling using the supplied random `generator`. + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image + generation. Can be used to tweak the same generation with different prompts. If not provided, a latents + tensor is generated by sampling using the supplied random `generator`. jit (`bool`, defaults to `False`): - Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument - exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release. + Whether to run `pmap` versions of the generation and safety scoring functions. + + + + This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a + future release. + + + return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of a plain tuple. @@ -437,10 +445,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a - `tuple. When returning a tuple, the first element is a list with the generated images, and the second - element is a list of `bool`s denoting whether the corresponding generated image likely represents - "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is a list with the generated images + and the second element is a list of `bool`s indicating whether the corresponding generated image + contains "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py index 1c31670a80c8..4baa3c51e8a7 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py @@ -539,8 +539,8 @@ def __call__( Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds` - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py index cae0f3a347de..6cc46d3ec11e 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py @@ -64,32 +64,28 @@ def preprocess(image): class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for text-guided image to image generation using Stable Diffusion. + Pipeline for text-guided depth-based image-to-image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading and saving methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] + - [`~loaders.LoraLoaderMixin.load_lora_weights`] + - [`~loaders.LoraLoaderMixin.save_lora_weights`] Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. """ def __init__( @@ -521,68 +517,60 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch, that will be used as the starting point for the - process. Can accept image latents as `image` only if `depth_map` is not `None`. + `Image` or tensor representing an image batch to be used as the starting point. Can accept image + latents as `image` only if `depth_map` is not `None`. depth_map (`torch.FloatTensor`, *optional*): - depth prediction that will be used as additional conditioning for the image generation process. If not - defined, it will automatically predicts the depth via `self.depth_estimator`. + Depth prediction to be used as additional conditioning for the image generation process. If not + defined, it automatically predicts the depth with `self.depth_estimator`. strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in - `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. This parameter will be modulated by `strength`. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: @@ -609,10 +597,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs self.check_inputs( diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py index ebcb55a7cff9..efe065d0e2cc 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py @@ -36,27 +36,27 @@ class StableDiffusionImageVariationPipeline(DiffusionPipeline): r""" - Pipeline to generate variations from an input image using Stable Diffusion. + Pipeline to generate image variations from an input image using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. image_encoder ([`CLIPVisionModelWithProjection`]): - Frozen CLIP image-encoder. Stable Diffusion Image Variation uses the vision portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ # TODO: feature_extractor is required to encode images (if they are in PIL format), # we should give a descriptive message if the pipeline doesn't have one. @@ -253,58 +253,74 @@ def __call__( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`): - The image or images to guide the image generation. If you provide a tensor, it needs to comply with the - configuration of - [this](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json) - `CLIPImageProcessor` + Image or images to guide image generation. If you provide a tensor, it needs to be compatible with + [`CLIPImageProcessor`](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json). height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The height in pixels of the generated image. width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. + + Examples: + + ```py + from diffusers import StableDiffusionImageVariationPipeline + from PIL import Image + from io import BytesIO + import requests + + pipe = StableDiffusionImageVariationPipeline.from_pretrained( + "lambdalabs/sd-image-variations-diffusers", revision="v2.0" + ) + pipe = pipe.to("cuda") + + url = "https://lh3.googleusercontent.com/y-iFOHfLTwkuQSUegpwDdgKmOjRSTvPxat63dQLB25xkTs4lhIbRUFeNBWZzYf370g=s1200" + + response = requests.get(url) + image = Image.open(BytesIO(response.content)).convert("RGB") + + out = pipe(image, num_images_per_prompt=3, guidance_scale=15) + out["images"][0].save("result.jpg") + ``` """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py index e4c5928c08a4..cc89824772e8 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py @@ -102,38 +102,34 @@ class StableDiffusionImg2ImgPipeline( DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin ): r""" - Pipeline for text-guided image to image generation using Stable Diffusion. + Pipeline for text-guided image-to-image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] - Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -593,74 +589,66 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch, that will be used as the starting point for the - process. Can also accpet image latents as `image`, if passing latents directly, it will not be encoded - again. + `Image` or tensor representing an image batch to be used as the starting point. Can also accept image + latents as `image`, if passing latents directly, it will not be encoded again. strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in - `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. This parameter will be modulated by `strength`. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs. Raise error if not correct self.check_inputs(prompt, strength, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds) diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py index 11cd7851c754..f5624fc0e63a 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py @@ -158,45 +158,32 @@ class StableDiffusionInpaintPipeline( r""" Pipeline for text-guided image inpainting using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] - - - - It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such - as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default - text-to-image stable diffusion checkpoints, such as - [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible with - this pipeline, but might be less performant. - - + The pipeline also inherits the following loading and saving methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] + - [`~loaders.LoraLoaderMixin.load_lora_weights`] + - [`~loaders.LoraLoaderMixin.save_lora_weights`] Args: vae ([`AutoencoderKL`, `AsymmetricAutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -706,79 +693,71 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`PIL.Image.Image`): - `Image`, or tensor representing an image batch which will be inpainted, *i.e.* parts of the image will - be masked out with `mask_image` and repainted according to `prompt`. + `Image` or tensor representing an image batch to be inpainted (which parts of the image to be masked + out with `mask_image` and repainted according to `prompt`). mask_image (`PIL.Image.Image`): - `Image`, or tensor representing an image batch, to mask `image`. White pixels in the mask will be - repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted - to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L) - instead of 3, so the expected shape would be `(B, H, W, 1)`. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + `Image` or tensor representing an image batch to mask `image`. White pixels in the mask are repainted + while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a single channel + (luminance) before use. If it's a tensor, it should contain one color channel (L) instead of 3, so the + expected shape would be `(B, H, W, 1)`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. - strength (`float`, *optional*, defaults to 1.): - Conceptually, indicates how much to transform the masked portion of the reference `image`. Must be - between 0 and 1. `image` will be used as a starting point, adding more noise to it the larger the - `strength`. The number of denoising steps depends on the amount of noise initially added. When - `strength` is 1, added noise will be maximum and the denoising process will run for the full number of - iterations specified in `num_inference_steps`. A value of 1, therefore, essentially ignores the masked - portion of the reference `image`. + strength (`float`, *optional*, defaults to 0.8): + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + Examples: ```py @@ -812,10 +791,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py index 9d5022c854a8..8692bbdd82dc 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py @@ -60,25 +60,23 @@ def preprocess(image): class StableDiffusionLatentUpscalePipeline(DiffusionPipeline): r""" - Pipeline to upscale the resolution of Stable Diffusion output images by a factor of 2. + Pipeline for upscaling Stable Diffusion output image resolution by a factor of 2. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`EulerDiscreteScheduler`]. + `EulerDiscreteScheduler`]. """ def __init__( @@ -279,50 +277,46 @@ def __call__( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image upscaling. + The prompt or prompts to guide image upscaling. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch which will be upscaled. If it's a tensor, it can be - either a latent output from a stable diffusion model, or an image tensor in the range `[-1, 1]`. It - will be considered a `latent` if `image.shape[1]` is `4`; otherwise, it will be considered to be an - image representation and encoded using this pipeline's `vae` encoder. + `Image` or tensor representing an image batch to be upscaled. If it's a tensor, it can be either a + latent output from a Stable Diffusion model or an image tensor in the range `[-1, 1]`. It is considered + a `latent` if `image.shape[1]` is `4`; otherwise, it is considered to be an image representation and + encoded using this pipeline's `vae` encoder. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: ```py @@ -363,10 +357,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py index 280779cd8539..a7911f52710f 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py @@ -62,11 +62,11 @@ class LDM3DPipelineOutput(BaseOutput): Args: images (`List[PIL.Image.Image]` or `np.ndarray`) - List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, - num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. + List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, + num_channels)`. nsfw_content_detected (`List[bool]`) - List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, or `None` if safety checking could not be performed. + List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or + `None` if safety checking could not be performed. """ rgb: Union[List[PIL.Image.Image], np.ndarray] @@ -78,11 +78,10 @@ class StableDiffusionLDM3DPipeline( DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin ): r""" - Pipeline for text-to-image and 3d generation using LDM3D. LDM3D: Latent Diffusion Model for 3D: - https://arxiv.org/abs/2305.10853 + Pipeline for text-to-image and 3D generation using LDM3D. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] @@ -94,24 +93,22 @@ class StableDiffusionLDM3DPipeline( Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode rgb and depth images to and from latent - representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded rgb and depth latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -495,73 +492,65 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (`float`, *optional*, defaults to 5.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + guidance_scale (`float`, *optional*, defaults to 7.5): + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py index 61b1419d5ced..81886ebd6bb5 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py @@ -69,20 +69,18 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi r""" Pipeline for text-guided image super-resolution using Stable Diffusion 2. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. low_res_scheduler ([`SchedulerMixin`]): A scheduler used to add initial noise to the low res conditioning image. It must be an instance of [`DDPMScheduler`]. @@ -142,10 +140,10 @@ def __init__( def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -513,62 +511,57 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch which will be upscaled. * + `Image` or tensor representing an image batch to be upscaled. + num_inference_steps (`int`, *optional*, defaults to 50): + The number of denoising steps. More denoising steps usually lead to a higher quality image at the + expense of slower inference. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: ```py @@ -598,10 +591,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs diff --git a/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py b/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py index f172575bc6c7..de28a0aebfa3 100644 --- a/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py +++ b/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py @@ -21,32 +21,29 @@ class StableDiffusionPipelineSafe(DiffusionPipeline): r""" - Pipeline for text-to-image generation using Safe Latent Diffusion. + Pipeline based on the [`StableDiffusionPipeline`] for text-to-image generation using Safe Latent Diffusion. - The implementation is based on the [`StableDiffusionPipeline`] - - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -489,12 +486,12 @@ def __call__( sld_mom_beta: Optional[float] = 0.4, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): The width in pixels of the generated image. @@ -502,65 +499,69 @@ def __call__( The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. sld_guidance_scale (`float`, *optional*, defaults to 1000): - Safe latent guidance as defined in [Safe Latent Diffusion](https://arxiv.org/abs/2211.05105). - `sld_guidance_scale` is defined as sS of Eq. 6. If set to be less than 1, safety guidance will be - disabled. + If `sld_guidance_scale < 1`, safety guidance is disabled. sld_warmup_steps (`int`, *optional*, defaults to 10): - Number of warmup steps for safety guidance. SLD will only be applied for diffusion steps greater than - `sld_warmup_steps`. `sld_warmup_steps` is defined as `delta` of [Safe Latent - Diffusion](https://arxiv.org/abs/2211.05105). + Number of warmup steps for safety guidance. SLD is only be applied for diffusion steps greater than + `sld_warmup_steps`. sld_threshold (`float`, *optional*, defaults to 0.01): - Threshold that separates the hyperplane between appropriate and inappropriate images. `sld_threshold` - is defined as `lamda` of Eq. 5 in [Safe Latent Diffusion](https://arxiv.org/abs/2211.05105). + Threshold that separates the hyperplane between appropriate and inappropriate images. sld_momentum_scale (`float`, *optional*, defaults to 0.3): - Scale of the SLD momentum to be added to the safety guidance at each diffusion step. If set to 0.0 - momentum will be disabled. Momentum is already built up during warmup, i.e. for diffusion steps smaller - than `sld_warmup_steps`. `sld_momentum_scale` is defined as `sm` of Eq. 7 in [Safe Latent - Diffusion](https://arxiv.org/abs/2211.05105). + Scale of the SLD momentum to be added to the safety guidance at each diffusion step. If set to 0.0, + momentum is disabled. Momentum is built up during warmup for diffusion steps smaller than + `sld_warmup_steps`. sld_mom_beta (`float`, *optional*, defaults to 0.4): Defines how safety guidance momentum builds up. `sld_mom_beta` indicates how much of the previous - momentum will be kept. Momentum is already built up during warmup, i.e. for diffusion steps smaller - than `sld_warmup_steps`. `sld_mom_beta` is defined as `beta m` of Eq. 8 in [Safe Latent - Diffusion](https://arxiv.org/abs/2211.05105). + momentum is kept. Momentum is built up during warmup for diffusion steps smaller than + `sld_warmup_steps`. + Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. + + Examples: + + ```py + import torch + from diffusers import StableDiffusionPipelineSafe + + pipeline = StableDiffusionPipelineSafe.from_pretrained( + "AIML-TUDA/stable-diffusion-safe", torch_dtype=torch.float16 + ) + prompt = "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c. leyendecker" + image = pipeline(prompt=prompt, **SafetyConfig.MEDIUM).images[0] + ``` """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor From 3a01e00394b086d5b5c3c5910925291c3535a87d Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 30 Jun 2023 16:31:09 -0700 Subject: [PATCH 04/13] fix path to pipeline output --- docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx | 4 +++- .../en/api/pipelines/stable_diffusion/image_variation.mdx | 2 +- docs/source/en/api/pipelines/stable_diffusion/img2img.mdx | 2 +- docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx | 2 +- .../en/api/pipelines/stable_diffusion/latent_upscale.mdx | 4 ++-- docs/source/en/api/pipelines/stable_diffusion/text2img.mdx | 2 +- docs/source/en/api/pipelines/stable_diffusion/upscale.mdx | 2 +- 7 files changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx index 173a8facfb6e..7493005b0a1f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx @@ -29,4 +29,6 @@ The original codebase can be found at [Stability-AI/stablediffusion](https://git - load_lora_weights - save_lora_weights -## StableDiffusionPipelineOutput \ No newline at end of file +## StableDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx index 831d16f1317f..38e02fa6652e 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx @@ -28,4 +28,4 @@ The original codebase can be found at [Stable Diffusion Image Variations](https: ## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx index d99c4535ba29..9fb3467a6bec 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx @@ -36,7 +36,7 @@ The abstract from the paper is: ## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ## FlaxStableDiffusionImg2ImgPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx index 44e2e5c464ac..81cb7e73a7cd 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx @@ -46,7 +46,7 @@ this pipeline but might be less performant. ## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ## FlaxStableDiffusionInpaintPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx index 55aa603f9e6a..9f91eb97ce5e 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx @@ -27,6 +27,6 @@ The [Stable Diffusion Upscaler Demo](https://colab.research.google.com/drive/1o1 - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention -# StableDiffusionPipelineOutput +## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 05b50d8595ce..fc8a501648a6 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -46,7 +46,7 @@ Additional official checkpoints for different versions of the Stable Diffusion m ## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ## FlaxStableDiffusionPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx index 394054a9ca8b..82038fe36195 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx @@ -28,4 +28,4 @@ The original codebase can be found at [Stability-AI/stablediffusion](https://git ## StableDiffusionPipelineOutput -[[autodoc]] StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file From 184821c399207308dcf176c958bc50114aa260be Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 30 Jun 2023 16:45:20 -0700 Subject: [PATCH 05/13] fix flax paths --- docs/source/en/api/pipelines/stable_diffusion/img2img.mdx | 2 +- docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx | 2 +- docs/source/en/api/pipelines/stable_diffusion/text2img.mdx | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx index 9fb3467a6bec..85c167671204 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx @@ -46,4 +46,4 @@ The abstract from the paper is: ## FlaxStableDiffusionPipelineOutput -[[autodoc]] FlaxStableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx index 81cb7e73a7cd..c45c5d9c80a5 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx @@ -56,4 +56,4 @@ this pipeline but might be less performant. ## FlaxStableDiffusionPipelineOutput -[[autodoc]] FlaxStableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index fc8a501648a6..1191be418315 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -56,4 +56,4 @@ Additional official checkpoints for different versions of the Stable Diffusion m ## FlaxStableDiffusionPipelineOutput -[[autodoc]] FlaxStableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput \ No newline at end of file From 799868b7c92e37c6dd26873bfc5a47a24768f3f0 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 6 Jul 2023 09:56:18 -0700 Subject: [PATCH 06/13] fix copies --- .../pipeline_text_to_video_synth_img2img.py | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py index 72a5b762d504..c8745d79c58f 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py @@ -178,17 +178,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -196,17 +194,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() From b7c573d73639fbd3cfda2c92eb430118bcd78cd0 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 6 Jul 2023 17:29:04 -0700 Subject: [PATCH 07/13] add up to score sde ve --- .../source/en/api/pipelines/alt_diffusion.mdx | 66 +---- .../en/api/pipelines/attend_and_excite.mdx | 62 +---- .../en/api/pipelines/audio_diffusion.mdx | 83 +----- docs/source/en/api/pipelines/audioldm.mdx | 65 +---- .../en/api/pipelines/consistency_models.mdx | 92 ++---- docs/source/en/api/pipelines/controlnet.mdx | 25 +- .../en/api/pipelines/cycle_diffusion.mdx | 83 +----- .../en/api/pipelines/dance_diffusion.mdx | 19 +- docs/source/en/api/pipelines/ddim.mdx | 21 +- docs/source/en/api/pipelines/ddpm.mdx | 22 +- docs/source/en/api/pipelines/diffedit.mdx | 24 +- docs/source/en/api/pipelines/dit.mdx | 44 +-- .../en/api/pipelines/latent_diffusion.mdx | 27 +- .../en/api/pipelines/paint_by_example.mdx | 57 +--- docs/source/en/api/pipelines/panorama.mdx | 54 +--- docs/source/en/api/pipelines/paradigms.mdx | 65 +---- docs/source/en/api/pipelines/pix2pix.mdx | 55 +--- docs/source/en/api/pipelines/pix2pix_zero.mdx | 15 +- docs/source/en/api/pipelines/pndm.mdx | 20 +- docs/source/en/api/pipelines/repaint.mdx | 63 +---- docs/source/en/api/pipelines/score_sde_ve.mdx | 21 +- .../pipelines/stable_diffusion/depth2img.mdx | 2 +- .../pipelines/stable_diffusion/inpaint.mdx | 2 + .../stable_diffusion/ldm3d_diffusion.mdx | 13 - .../pipelines/stable_diffusion/overview.mdx | 54 ++-- .../stable_diffusion/stable_diffusion_2.mdx | 2 +- .../pipelines/stable_diffusion/text2img.mdx | 8 - .../pipelines/audio_diffusion/mel.py | 57 ++-- .../pipeline_audio_diffusion.py | 139 ++++++++-- .../pipelines/audioldm/pipeline_audioldm.py | 93 +++---- .../pipeline_consistency_models.py | 51 ++-- .../pipeline_dance_diffusion.py | 49 +++- src/diffusers/pipelines/ddim/pipeline_ddim.py | 55 +++- src/diffusers/pipelines/ddpm/pipeline_ddpm.py | 41 ++- src/diffusers/pipelines/dit/pipeline_dit.py | 63 ++++- .../pipeline_latent_diffusion.py | 68 +++-- ...peline_latent_diffusion_superresolution.py | 66 +++-- .../pipeline_paint_by_example.py | 130 ++++++--- src/diffusers/pipelines/pndm/pipeline_pndm.py | 51 +++- .../pipelines/repaint/pipeline_repaint.py | 81 +++++- .../score_sde_ve/pipeline_score_sde_ve.py | 39 ++- .../pipeline_cycle_diffusion.py | 166 +++++++---- ...line_stable_diffusion_attend_and_excite.py | 98 +++---- .../pipeline_stable_diffusion_diffedit.py | 261 ++++++++---------- ...eline_stable_diffusion_instruct_pix2pix.py | 94 +++---- .../pipeline_stable_diffusion_ldm3d.py | 2 + .../pipeline_stable_diffusion_panorama.py | 99 ++++--- .../pipeline_stable_diffusion_paradigms.py | 102 +++---- 48 files changed, 1303 insertions(+), 1566 deletions(-) diff --git a/docs/source/en/api/pipelines/alt_diffusion.mdx b/docs/source/en/api/pipelines/alt_diffusion.mdx index 8463fd51ddbb..d5f7031d59f9 100644 --- a/docs/source/en/api/pipelines/alt_diffusion.mdx +++ b/docs/source/en/api/pipelines/alt_diffusion.mdx @@ -12,72 +12,26 @@ specific language governing permissions and limitations under the License. # AltDiffusion -AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu. +AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://huggingface.co/papers/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu. -The abstract of the paper is the following: +The abstract from the paper is: *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.* - -*Overview*: - -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [pipeline_alt_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py) | *Text-to-Image Generation* | - | - -| [pipeline_alt_diffusion_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py) | *Image-to-Image Text-Guided Generation* | - |- - -## Tips - -- AltDiffusion is conceptually exactly the same as [Stable Diffusion](./stable_diffusion/overview). - -- *Run AltDiffusion* - -AltDiffusion can be tested very easily with the [`AltDiffusionPipeline`], [`AltDiffusionImg2ImgPipeline`] and the `"BAAI/AltDiffusion-m9"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation) and the [Image-to-Image Generation Guide](../../using-diffusers/img2img). - -- *How to load and use different schedulers.* - -The alt diffusion pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the alt diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc. -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following: - -```python ->>> from diffusers import AltDiffusionPipeline, EulerDiscreteScheduler - ->>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9") ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) - ->>> # or ->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("BAAI/AltDiffusion-m9", subfolder="scheduler") ->>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", scheduler=euler_scheduler) -``` - - -- *How to convert all use cases with multiple or single pipeline* - -If you want to use all possible use cases in a single `DiffusionPipeline` we recommend using the `components` functionality to instantiate all components in the most memory-efficient way: - -```python ->>> from diffusers import ( -... AltDiffusionPipeline, -... AltDiffusionImg2ImgPipeline, -... ) - ->>> text2img = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9") ->>> img2img = AltDiffusionImg2ImgPipeline(**text2img.components) - ->>> # now you can use text2img(...) and img2img(...) just like the call methods of each respective pipeline -``` - -## AltDiffusionPipelineOutput -[[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput - - all - - __call__ - ## AltDiffusionPipeline + [[autodoc]] AltDiffusionPipeline - all - __call__ ## AltDiffusionImg2ImgPipeline + [[autodoc]] AltDiffusionImg2ImgPipeline - all - __call__ + +## AltDiffusionPipelineOutput + +[[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput + - all + - __call__ \ No newline at end of file diff --git a/docs/source/en/api/pipelines/attend_and_excite.mdx b/docs/source/en/api/pipelines/attend_and_excite.mdx index 1a329bc442e7..3c9fde50ca35 100644 --- a/docs/source/en/api/pipelines/attend_and_excite.mdx +++ b/docs/source/en/api/pipelines/attend_and_excite.mdx @@ -10,66 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models +# Attend and Excite -## Overview +Attend and Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. -Attend and Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over the image generation. - -The abstract of the paper is the following: +The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* -Resources - -* [Project Page](https://attendandexcite.github.io/Attend-and-Excite/) -* [Paper](https://arxiv.org/abs/2301.13826) -* [Original Code](https://github.com/AttendAndExcite/Attend-and-Excite) -* [Demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite) - - -## Available Pipelines: - -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [pipeline_semantic_stable_diffusion_attend_and_excite.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_semantic_stable_diffusion_attend_and_excite) | *Text-to-Image Generation* | - | https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite - - -### Usage example - - -```python -import torch -from diffusers import StableDiffusionAttendAndExcitePipeline - -model_id = "CompVis/stable-diffusion-v1-4" -pipe = StableDiffusionAttendAndExcitePipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") -pipe = pipe.to("cuda") - -prompt = "a cat and a frog" - -# use get_indices function to find out indices of the tokens you want to alter -pipe.get_indices(prompt) - -token_indices = [2, 5] -seed = 6141 -generator = torch.Generator("cuda").manual_seed(seed) - -images = pipe( - prompt=prompt, - token_indices=token_indices, - guidance_scale=7.5, - generator=generator, - num_inference_steps=50, - max_iter_to_alter=25, -).images - -image = images[0] -image.save(f"../images/{prompt}_{seed}.png") -``` - +You can find additional information about Attend and Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), [paper](https://arxiv.org/abs/2301.13826), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). ## StableDiffusionAttendAndExcitePipeline + [[autodoc]] StableDiffusionAttendAndExcitePipeline - all - __call__ + +## StableDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/audio_diffusion.mdx b/docs/source/en/api/pipelines/audio_diffusion.mdx index b6d64c938060..20b97a80a733 100644 --- a/docs/source/en/api/pipelines/audio_diffusion.mdx +++ b/docs/source/en/api/pipelines/audio_diffusion.mdx @@ -12,87 +12,20 @@ specific language governing permissions and limitations under the License. # Audio Diffusion -## Overview +[Audio Diffusion](https://github.com/teticio/audio-diffusion) is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images. -[Audio Diffusion](https://github.com/teticio/audio-diffusion) by Robert Dargavel Smith. - -Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to -and from mel spectrogram images. - -The original codebase of this implementation can be found [here](https://github.com/teticio/audio-diffusion), including -training scripts and example notebooks. - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_audio_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py) | *Unconditional Audio Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb) | - - -## Examples: - -### Audio Diffusion - -```python -import torch -from IPython.display import Audio -from diffusers import DiffusionPipeline - -device = "cuda" if torch.cuda.is_available() else "cpu" -pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device) - -output = pipe() -display(output.images[0]) -display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) -``` - -### Latent Audio Diffusion - -```python -import torch -from IPython.display import Audio -from diffusers import DiffusionPipeline - -device = "cuda" if torch.cuda.is_available() else "cpu" -pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device) - -output = pipe() -display(output.images[0]) -display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) -``` - -### Audio Diffusion with DDIM (faster) - -```python -import torch -from IPython.display import Audio -from diffusers import DiffusionPipeline - -device = "cuda" if torch.cuda.is_available() else "cpu" -pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device) - -output = pipe() -display(output.images[0]) -display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) -``` - -### Variations, in-painting, out-painting etc. - -```python -output = pipe( - raw_audio=output.audios[0, 0], - start_step=int(pipe.get_default_steps() / 2), - mask_start_secs=1, - mask_end_secs=1, -) -display(output.images[0]) -display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) -``` +The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion). ## AudioDiffusionPipeline [[autodoc]] AudioDiffusionPipeline - all - __call__ +## AudioPipelineOutput +[[autodoc]] pipelines.AudioPipelineOutput + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput + ## Mel [[autodoc]] Mel diff --git a/docs/source/en/api/pipelines/audioldm.mdx b/docs/source/en/api/pipelines/audioldm.mdx index 25a5bb8bce13..36120b9ec585 100644 --- a/docs/source/en/api/pipelines/audioldm.mdx +++ b/docs/source/en/api/pipelines/audioldm.mdx @@ -12,73 +12,32 @@ specific language governing permissions and limitations under the License. # AudioLDM -## Overview - -AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al. +AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. -This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM). - -## Text-to-Audio - -The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) and generate text-conditional audio outputs: - -```python -from diffusers import AudioLDMPipeline -import torch -import scipy +The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM), and the pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). -repo_id = "cvssp/audioldm-s-full-v2" -pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) -pipe = pipe.to("cuda") +## Tips -prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" -audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0] +When constructing a prompt, keep in mind: -# save the audio sample as a .wav file -scipy.io.wavfile.write("techno.wav", rate=16000, data=audio) -``` +* Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream"). +* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. -### Tips +During inference: -Prompts: -* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream"). -* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with. - -Inference: -* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference. +* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. -### How to load and use different schedulers - -The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers -that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], -[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest -scheduler there is. - -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] -method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the -[`DPMSolverMultistepScheduler`], you can do the following: - -```python ->>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler ->>> import torch - ->>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm-s-full-v2", torch_dtype=torch.float16) ->>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) - ->>> # or ->>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm-s-full-v2", subfolder="scheduler") ->>> pipeline = AudioLDMPipeline.from_pretrained( -... "cvssp/audioldm-s-full-v2", scheduler=dpm_scheduler, torch_dtype=torch.float16 -... ) -``` - ## AudioLDMPipeline [[autodoc]] AudioLDMPipeline - all - __call__ + +## StableDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/consistency_models.mdx b/docs/source/en/api/pipelines/consistency_models.mdx index 56ec2e0f3432..fa4d36102d03 100644 --- a/docs/source/en/api/pipelines/consistency_models.mdx +++ b/docs/source/en/api/pipelines/consistency_models.mdx @@ -1,87 +1,43 @@ # Consistency Models -Consistency Models were proposed in [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. +Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. -The abstract of the [paper](https://arxiv.org/pdf/2303.01469.pdf) is as follows: +The abstract from the [paper](https://arxiv.org/pdf/2303.01469.pdf) is: *Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. * -Resources: +The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai). -* [Paper](https://arxiv.org/abs/2303.01469) -* [Original Code](https://github.com/openai/consistency_models) +The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). ❤️ -Available Checkpoints are: -- *cd_imagenet64_l2 (64x64 resolution)* [openai/consistency-model-pipelines](https://huggingface.co/openai/diffusers-cd_imagenet64_l2) -- *cd_imagenet64_lpips (64x64 resolution)* [openai/diffusers-cd_imagenet64_lpips](https://huggingface.co/openai/diffusers-cd_imagenet64_lpips) -- *ct_imagenet64 (64x64 resolution)* [openai/diffusers-ct_imagenet64](https://huggingface.co/openai/diffusers-ct_imagenet64) -- *cd_bedroom256_l2 (256x256 resolution)* [openai/diffusers-cd_bedroom256_l2](https://huggingface.co/openai/diffusers-cd_bedroom256_l2) -- *cd_bedroom256_lpips (256x256 resolution)* [openai/diffusers-cd_bedroom256_lpips](https://huggingface.co/openai/diffusers-cd_bedroom256_lpips) -- *ct_bedroom256 (256x256 resolution)* [openai/diffusers-ct_bedroom256](https://huggingface.co/openai/diffusers-ct_bedroom256) -- *cd_cat256_l2 (256x256 resolution)* [openai/diffusers-cd_cat256_l2](https://huggingface.co/openai/diffusers-cd_cat256_l2) -- *cd_cat256_lpips (256x256 resolution)* [openai/diffusers-cd_cat256_lpips](https://huggingface.co/openai/diffusers-cd_cat256_lpips) -- *ct_cat256 (256x256 resolution)* [openai/diffusers-ct_cat256](https://huggingface.co/openai/diffusers-ct_cat256) +## Tips -## Available Pipelines +For an additional speed-up, use `torch.compile` to generate multiple images in <1 second: -| Pipeline | Tasks | Demo | Colab | -|:---:|:---:|:---:|:---:| -| [ConsistencyModelPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_consistency_models.py) | *Unconditional Image Generation* | | | +```diff + import torch + from diffusers import ConsistencyModelPipeline -This pipeline was contributed by our community members [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues) ❤️ + device = "cuda" + # Load the cd_bedroom256_lpips checkpoint. + model_id_or_path = "openai/diffusers-cd_bedroom256_lpips" + pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) + pipe.to(device) -## Usage Example ++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -```python -import torch - -from diffusers import ConsistencyModelPipeline - -device = "cuda" -# Load the cd_imagenet64_l2 checkpoint. -model_id_or_path = "openai/diffusers-cd_imagenet64_l2" -pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Onestep Sampling -image = pipe(num_inference_steps=1).images[0] -image.save("consistency_model_onestep_sample.png") - -# Onestep sampling, class-conditional image generation -# ImageNet-64 class label 145 corresponds to king penguins -image = pipe(num_inference_steps=1, class_labels=145).images[0] -image.save("consistency_model_onestep_sample_penguin.png") - -# Multistep sampling, class-conditional image generation -# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo. -# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77 -image = pipe(timesteps=[22, 0], class_labels=145).images[0] -image.save("consistency_model_multistep_sample_penguin.png") -``` - -For an additional speed-up, one can also make use of `torch.compile`. Multiple images can be generated in <1 second as follows: - -```py -import torch -from diffusers import ConsistencyModelPipeline - -device = "cuda" -# Load the cd_bedroom256_lpips checkpoint. -model_id_or_path = "openai/diffusers-cd_bedroom256_lpips" -pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -# Multistep sampling -# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo: -# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83 -for _ in range(10): - image = pipe(timesteps=[17, 0]).images[0] - image.show() + # Multistep sampling + # Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo: + # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83 + for _ in range(10): + image = pipe(timesteps=[17, 0]).images[0] + image.show() ``` ## ConsistencyModelPipeline [[autodoc]] ConsistencyModelPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/controlnet.mdx b/docs/source/en/api/pipelines/controlnet.mdx index f9e4c3c47e3e..ab5ddc9b29a2 100644 --- a/docs/source/en/api/pipelines/controlnet.mdx +++ b/docs/source/en/api/pipelines/controlnet.mdx @@ -10,32 +10,19 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Text-to-Image Generation with ControlNet Conditioning +# ControlNet -## Overview +[Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. -[Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) by Lvmin Zhang and Maneesh Agrawala. +Using a pretrained model, we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. -Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. - -The abstract of the paper is the following: +The abstract from the paper is: *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.* -This model was contributed by the community contributor [takuma104](https://huggingface.co/takuma104) ❤️ . - -Resources: - -* [Paper](https://arxiv.org/abs/2302.05543) -* [Original Code](https://github.com/lllyasviel/ControlNet) - -## Available Pipelines: +This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️ -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet.py) | *Text-to-Image Generation with ControlNet Conditioning* | [Colab Example](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb) -| [StableDiffusionControlNetImg2ImgPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py) | *Image-to-Image Generation with ControlNet Conditioning* | -| [StableDiffusionControlNetInpaintPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_controlnet_inpaint.py) | *Inpainting Generation with ControlNet Conditioning* | +The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet). ## Usage example diff --git a/docs/source/en/api/pipelines/cycle_diffusion.mdx b/docs/source/en/api/pipelines/cycle_diffusion.mdx index b8fbff5d7157..e0f74ae94845 100644 --- a/docs/source/en/api/pipelines/cycle_diffusion.mdx +++ b/docs/source/en/api/pipelines/cycle_diffusion.mdx @@ -12,89 +12,16 @@ specific language governing permissions and limitations under the License. # Cycle Diffusion -## Overview +Cycle Diffusion is a text guided image-to-image generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://huggingface.co/papers/2210.05559) by Chen Henry Wu, Fernando De la Torre. -Cycle Diffusion is a Text-Guided Image-to-Image Generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) by Chen Henry Wu, Fernando De la Torre. - -The abstract of the paper is the following: +The abstract from the paper is: *Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.* -*Tips*: -- The Cycle Diffusion pipeline is fully compatible with any [Stable Diffusion](./stable_diffusion) checkpoints -- Currently Cycle Diffusion only works with the [`DDIMScheduler`]. - -*Example*: - -In the following we should how to best use the [`CycleDiffusionPipeline`] - -```python -import requests -import torch -from PIL import Image -from io import BytesIO - -from diffusers import CycleDiffusionPipeline, DDIMScheduler - -# load the pipeline -# make sure you're logged in with `huggingface-cli login` -model_id_or_path = "CompVis/stable-diffusion-v1-4" -scheduler = DDIMScheduler.from_pretrained(model_id_or_path, subfolder="scheduler") -pipe = CycleDiffusionPipeline.from_pretrained(model_id_or_path, scheduler=scheduler).to("cuda") - -# let's download an initial image -url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/An%20astronaut%20riding%20a%20horse.png" -response = requests.get(url) -init_image = Image.open(BytesIO(response.content)).convert("RGB") -init_image = init_image.resize((512, 512)) -init_image.save("horse.png") - -# let's specify a prompt -source_prompt = "An astronaut riding a horse" -prompt = "An astronaut riding an elephant" - -# call the pipeline -image = pipe( - prompt=prompt, - source_prompt=source_prompt, - image=init_image, - num_inference_steps=100, - eta=0.1, - strength=0.8, - guidance_scale=2, - source_guidance_scale=1, -).images[0] - -image.save("horse_to_elephant.png") - -# let's try another example -# See more samples at the original repo: https://github.com/ChenWu98/cycle-diffusion -url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/A%20black%20colored%20car.png" -response = requests.get(url) -init_image = Image.open(BytesIO(response.content)).convert("RGB") -init_image = init_image.resize((512, 512)) -init_image.save("black.png") - -source_prompt = "A black colored car" -prompt = "A blue colored car" - -# call the pipeline -torch.manual_seed(0) -image = pipe( - prompt=prompt, - source_prompt=source_prompt, - image=init_image, - num_inference_steps=100, - eta=0.1, - strength=0.85, - guidance_scale=3, - source_guidance_scale=1, -).images[0] - -image.save("black_to_blue.png") -``` - ## CycleDiffusionPipeline [[autodoc]] CycleDiffusionPipeline - all - __call__ + +## StableDiffusionPiplineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/dance_diffusion.mdx b/docs/source/en/api/pipelines/dance_diffusion.mdx index 92b5b9f877bc..9d8ceb5b8868 100644 --- a/docs/source/en/api/pipelines/dance_diffusion.mdx +++ b/docs/source/en/api/pipelines/dance_diffusion.mdx @@ -12,23 +12,16 @@ specific language governing permissions and limitations under the License. # Dance Diffusion -## Overview +[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans. -[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) by Zach Evans. - -Dance Diffusion is the first in a suite of generative audio tools for producers and musicians to be released by Harmonai. -For more info or to get involved in the development of these tools, please visit https://harmonai.org and fill out the form on the front page. - -The original codebase of this implementation can be found [here](https://github.com/Harmonai-org/sample-generator). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_dance_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py) | *Unconditional Audio Generation* | - | +Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org). +The original codebase of this implementation can be found at [Harmonai-org](https://github.com/Harmonai-org/sample-generator). ## DanceDiffusionPipeline [[autodoc]] DanceDiffusionPipeline - all - __call__ + +## AudioPipelineOutput +[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/ddim.mdx b/docs/source/en/api/pipelines/ddim.mdx index 3adcb375b448..98da201545b9 100644 --- a/docs/source/en/api/pipelines/ddim.mdx +++ b/docs/source/en/api/pipelines/ddim.mdx @@ -12,25 +12,18 @@ specific language governing permissions and limitations under the License. # DDIM -## Overview +[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. -[Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. +The abstract from the paper is: -The abstract of the paper is the following: - -Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space. - -The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim). -For questions, feel free to contact the author on [tsong.me](https://tsong.me/). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim/pipeline_ddim.py) | *Unconditional Image Generation* | - | +*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* +The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author at [tsong.me](https://tsong.me/). ## DDIMPipeline [[autodoc]] DDIMPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/ddpm.mdx b/docs/source/en/api/pipelines/ddpm.mdx index 1be71964041c..1f615106bcfe 100644 --- a/docs/source/en/api/pipelines/ddpm.mdx +++ b/docs/source/en/api/pipelines/ddpm.mdx @@ -12,26 +12,18 @@ specific language governing permissions and limitations under the License. # DDPM -## Overview +[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the 🤗 Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline. -[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) - (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes the diffusion based model of the same name, but in the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline. +The abstract from the paper is: -The abstract of the paper is the following: - -We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. - -The original codebase of this paper can be found [here](https://github.com/hojonathanho/diffusion). - - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm/pipeline_ddpm.py) | *Unconditional Image Generation* | - | +*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.* +The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion). # DDPMPipeline [[autodoc]] DDPMPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/diffedit.mdx b/docs/source/en/api/pipelines/diffedit.mdx index 8bb714971f15..986b9ec6a9c1 100644 --- a/docs/source/en/api/pipelines/diffedit.mdx +++ b/docs/source/en/api/pipelines/diffedit.mdx @@ -10,21 +10,15 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Zero-shot Diffusion-based Semantic Image Editing with Mask Guidance +# DiffEdit -## Overview +[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. -[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://arxiv.org/abs/2210.11427) by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. - -The abstract of the paper is the following: +The abstract from the paper is: *Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.* -Resources: - -* [Paper](https://arxiv.org/abs/2210.11427). -* [Blog Post with Demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). -* [Implementation on Github](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/). +The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion/](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). ## Tips @@ -51,14 +45,6 @@ below for more details. * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image. * Note that the source and target prompts, or their corresponding embeddings, can also be automatically generated. Please, refer to [this discussion](#generating-source-and-target-embeddings) for more details. -## Available Pipelines: - -| Pipeline | Tasks -|---|---| -| [StableDiffusionDiffEditPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py) | *Text-Based Image Editing* - - - ## Usage example ### Based on an input image with a caption @@ -357,4 +343,4 @@ images[0].save("edited_image.png") - all - generate_mask - invert - - __call__ + - __call__ \ No newline at end of file diff --git a/docs/source/en/api/pipelines/dit.mdx b/docs/source/en/api/pipelines/dit.mdx index ce96749a1720..26ca122b6c4e 100644 --- a/docs/source/en/api/pipelines/dit.mdx +++ b/docs/source/en/api/pipelines/dit.mdx @@ -10,50 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Scalable Diffusion Models with Transformers (DiT) +# DiT -## Overview +[Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie. -[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748) (DiT) by William Peebles and Saining Xie. - -The abstract of the paper is the following: +The abstract from the paper is: *We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.* -The original codebase of this paper can be found here: [facebookresearch/dit](https://github.com/facebookresearch/dit). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_dit.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dit/pipeline_dit.py) | *Conditional Image Generation* | - | - - -## Usage example - -```python -from diffusers import DiTPipeline, DPMSolverMultistepScheduler -import torch - -pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16) -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -# pick words from Imagenet class labels -pipe.labels # to print all available words - -# pick words that exist in ImageNet -words = ["white shark", "umbrella"] - -class_ids = pipe.get_label_ids(words) - -generator = torch.manual_seed(33) -output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator) - -image = output.images[0] # label 'white shark' -``` +The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit). ## DiTPipeline [[autodoc]] DiTPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/latent_diffusion.mdx b/docs/source/en/api/pipelines/latent_diffusion.mdx index 72c159e90d92..19d5e3a5f41b 100644 --- a/docs/source/en/api/pipelines/latent_diffusion.mdx +++ b/docs/source/en/api/pipelines/latent_diffusion.mdx @@ -12,31 +12,13 @@ specific language governing permissions and limitations under the License. # Latent Diffusion -## Overview +Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. -Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -The abstract of the paper is the following: +The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.* -The original codebase can be found [here](https://github.com/CompVis/latent-diffusion). - -## Tips: - -- -- -- - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_latent_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) | *Text-to-Image Generation* | - | -| [pipeline_latent_diffusion_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py) | *Super Resolution* | - | - -## Examples: - +The original codebase can be found at [Compvis/latent-diffusion](https://github.com/CompVis/latent-diffusion). ## LDMTextToImagePipeline [[autodoc]] LDMTextToImagePipeline @@ -47,3 +29,6 @@ The original codebase can be found [here](https://github.com/CompVis/latent-diff [[autodoc]] LDMSuperResolutionPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/paint_by_example.mdx b/docs/source/en/api/pipelines/paint_by_example.mdx index 5abb3406db44..0b78f6803ea2 100644 --- a/docs/source/en/api/pipelines/paint_by_example.mdx +++ b/docs/source/en/api/pipelines/paint_by_example.mdx @@ -12,63 +12,18 @@ specific language governing permissions and limitations under the License. # PaintByExample -## Overview +[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. -[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. - -The abstract of the paper is the following: +The abstract from the paper is: *Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.* -The original codebase can be found [here](https://github.com/Fantasy-Studio/Paint-by-Example). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_paint_by_example.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py) | *Image-Guided Image Painting* | - | - -## Tips - -- PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint has been warm-started from the [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) and with the objective to inpaint partly masked images conditioned on example / reference images -- To quickly demo *PaintByExample*, please have a look at [this demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example) -- You can run the following code snippet as an example: - - -```python -# !pip install diffusers transformers - -import PIL -import requests -import torch -from io import BytesIO -from diffusers import DiffusionPipeline - - -def download_image(url): - response = requests.get(url) - return PIL.Image.open(BytesIO(response.content)).convert("RGB") - - -img_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/image/example_1.png" -mask_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/mask/example_1.png" -example_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/reference/example_1.jpg" - -init_image = download_image(img_url).resize((512, 512)) -mask_image = download_image(mask_url).resize((512, 512)) -example_image = download_image(example_url).resize((512, 512)) - -pipe = DiffusionPipeline.from_pretrained( - "Fantasy-Studio/Paint-by-Example", - torch_dtype=torch.float16, -) -pipe = pipe.to("cuda") - -image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0] -image -``` +The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example). ## PaintByExamplePipeline [[autodoc]] PaintByExamplePipeline - all - __call__ + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/panorama.mdx b/docs/source/en/api/pipelines/panorama.mdx index 75c27f129ad8..f9c797227a18 100644 --- a/docs/source/en/api/pipelines/panorama.mdx +++ b/docs/source/en/api/pipelines/panorama.mdx @@ -10,55 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation +# MultiDiffusion -## Overview +[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. -[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://arxiv.org/abs/2302.08113) by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. +The abstract from the paper is: -The abstract of the paper is the following: +*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.* -*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. +You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [paper](https://arxiv.org/abs/2302.08113), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). -Resources: +## Tips -* [Project Page](https://multidiffusion.github.io/). -* [Paper](https://arxiv.org/abs/2302.08113). -* [Original Code](https://github.com/omerbt/MultiDiffusion). -* [Demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionPanoramaPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py) | *Text-Guided Panorama View Generation* | [🤗 Space](https://huggingface.co/spaces/weizmannscience/MultiDiffusion)) | - - - -## Usage example - -```python -import torch -from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler - -model_ckpt = "stabilityai/stable-diffusion-2-base" -scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler") -pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, torch_dtype=torch.float16) - -pipe = pipe.to("cuda") - -prompt = "a photo of the dolomites" -image = pipe(prompt).images[0] -image.save("dolomites.png") -``` - - - -While calling this pipeline, it's possible to specify the `view_batch_size` to have a >1 value. -For some GPUs with high performance, higher a `view_batch_size`, can speedup the generation -and increase the VRAM usage. - - +While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. +For some GPUs with high performance, this can speedup the generation process and increase VRAM usage. @@ -83,3 +48,6 @@ With circular padding, the right and the left parts are matching (`circular_padd [[autodoc]] StableDiffusionPanoramaPipeline - __call__ - all + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/paradigms.mdx b/docs/source/en/api/pipelines/paradigms.mdx index 938751c4874e..62504adb17aa 100644 --- a/docs/source/en/api/pipelines/paradigms.mdx +++ b/docs/source/en/api/pipelines/paradigms.mdx @@ -10,74 +10,39 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Parallel Sampling of Diffusion Models (ParaDiGMS) +# Parallel Sampling of Diffusion Models -## Overview +[Parallel Sampling of Diffusion Models](https://huggingface.co/papers/2305.16317) is by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari. -[Parallel Sampling of Diffusion Models](https://arxiv.org/abs/2305.16317) by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari. - -The abstract of the paper is the following: +The abstract from the paper is: *Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.* -Resources: - -* [Paper](https://arxiv.org/abs/2305.16317). -* [Original Code](https://github.com/AndyShih12/paradigms). - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionParadigmsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py) | *Faster Text-to-Image Generation* | | - -This pipeline was contributed by [`AndyShih12`](https://github.com/AndyShih12) in this [PR](https://github.com/huggingface/diffusers/pull/3716/). - -## Usage example +The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️ -```python -import torch -from diffusers import DDPMParallelScheduler -from diffusers import StableDiffusionParadigmsPipeline +## Tips -scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler") - -pipe = StableDiffusionParadigmsPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16 -) -pipe = pipe.to("cuda") - -ngpu, batch_per_device = torch.cuda.device_count(), 5 -pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)]) - -prompt = "a photo of an astronaut riding a horse on mars" -image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0] -``` - - This pipeline improves sampling speed by running denoising steps in parallel, at the cost of increased total FLOPs. Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth sampling may be even slower than sequential sampling. The two parameters to play with are `parallel` (batch size) and `tolerance`. -- If it fits in memory, for 1000-step DDPM you can aim for a batch size of around 100 -(e.g. 8 GPUs and batch_per_device=12 to get parallel=96). Higher batch size +- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 +(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size may not fit in memory, and lower batch size gives less parallelism. - For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. -If there is quality degradation with the default tolerance, then use a lower tolerance (e.g. 0.001). - -For 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup by StableDiffusionParadigmsPipeline instead of StableDiffusionPipeline -by setting parallel=80 and tolerance=0.1. - +If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`. - -Diffusers also offers distributed inference support for generating multiple prompts -in parallel on multiple GPUs. Check out the docs [here](https://huggingface.co/docs/diffusers/main/en/training/distributed_inference). +For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`] +by setting `parallel=80` and `tolerance=0.1`. -In contrast, this pipeline is designed for speeding up sampling of a single prompt (by using multiple GPUs). - +🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts +in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs. ## StableDiffusionParadigmsPipeline [[autodoc]] StableDiffusionParadigmsPipeline - __call__ - all + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/pix2pix.mdx b/docs/source/en/api/pipelines/pix2pix.mdx index d01f1df23385..d825ab4a6ed8 100644 --- a/docs/source/en/api/pipelines/pix2pix.mdx +++ b/docs/source/en/api/pipelines/pix2pix.mdx @@ -10,59 +10,15 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# InstructPix2Pix: Learning to Follow Image Editing Instructions +# InstructPix2Pix -## Overview +[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros. -[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) by Tim Brooks, Aleksander Holynski and Alexei A. Efros. - -The abstract of the paper is the following: +The abstract from the paper is: *We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.* -Resources: - -* [Project Page](https://www.timothybrooks.com/instruct-pix2pix). -* [Paper](https://arxiv.org/abs/2211.09800). -* [Original Code](https://github.com/timothybrooks/instruct-pix2pix). -* [Demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). - - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionInstructPix2PixPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py) | *Text-Based Image Editing* | [🤗 Space](https://huggingface.co/spaces/timbrooks/instruct-pix2pix) | - - - -## Usage example - -```python -import PIL -import requests -import torch -from diffusers import StableDiffusionInstructPix2PixPipeline - -model_id = "timbrooks/instruct-pix2pix" -pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") - -url = "https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" - - -def download_image(url): - image = PIL.Image.open(requests.get(url, stream=True).raw) - image = PIL.ImageOps.exif_transpose(image) - image = image.convert("RGB") - return image - - -image = download_image(url) - -prompt = "make the mountains snowy" -images = pipe(prompt, image=image, num_inference_steps=20, image_guidance_scale=1.5, guidance_scale=7).images -images[0].save("snowy_mountains.png") -``` +You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [paper](https://huggingface.co/papers/2211.09800), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). ## StableDiffusionInstructPix2PixPipeline [[autodoc]] StableDiffusionInstructPix2PixPipeline @@ -71,3 +27,6 @@ images[0].save("snowy_mountains.png") - load_textual_inversion - load_lora_weights - save_lora_weights + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/pix2pix_zero.mdx b/docs/source/en/api/pipelines/pix2pix_zero.mdx index f04a54f242ac..2502d4d57209 100644 --- a/docs/source/en/api/pipelines/pix2pix_zero.mdx +++ b/docs/source/en/api/pipelines/pix2pix_zero.mdx @@ -10,22 +10,15 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Zero-shot Image-to-Image Translation +# Pix2Pix Zero -## Overview +[Zero-shot Image-to-Image Translation](https://huggingface.co/papers/2302.03027) is by Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. -[Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027). - -The abstract of the paper is the following: +The abstract from the paper is: *Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.* -Resources: - -* [Project Page](https://pix2pixzero.github.io/). -* [Paper](https://arxiv.org/abs/2302.03027). -* [Original Code](https://github.com/pix2pixzero/pix2pix-zero). -* [Demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo). +You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/), [paper](https://arxiv.org/abs/2302.03027), [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo). ## Tips diff --git a/docs/source/en/api/pipelines/pndm.mdx b/docs/source/en/api/pipelines/pndm.mdx index 43625fdfbe52..f4f6bd311278 100644 --- a/docs/source/en/api/pipelines/pndm.mdx +++ b/docs/source/en/api/pipelines/pndm.mdx @@ -12,24 +12,18 @@ specific language governing permissions and limitations under the License. # PNDM -## Overview +[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao. -[Pseudo Numerical methods for Diffusion Models on manifolds](https://arxiv.org/abs/2202.09778) (PNDM) by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao. +The abstract from the paper is: -The abstract of the paper is the following: - -Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. - -The original codebase can be found [here](https://github.com/luping-liu/PNDM). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm/pipeline_pndm.py) | *Unconditional Image Generation* | - | +*Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules.* +The original codebase can be found at [luping-liu/PNDM](https://github.com/luping-liu/PNDM). ## PNDMPipeline [[autodoc]] PNDMPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/repaint.mdx b/docs/source/en/api/pipelines/repaint.mdx index 895d3011883c..72b4a32e116c 100644 --- a/docs/source/en/api/pipelines/repaint.mdx +++ b/docs/source/en/api/pipelines/repaint.mdx @@ -12,66 +12,19 @@ specific language governing permissions and limitations under the License. # RePaint -## Overview +[RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) is by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool. -[RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2201.09865) (PNDM) by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool. +The abstract from the paper is: -The abstract of the paper is the following: +*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. +RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions.* -Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. -RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. - -The original codebase can be found [here](https://github.com/andreas128/RePaint). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|-------------------------------------------------------------------------------------------------------------------------------|--------------------|:---:| -| [pipeline_repaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/repaint/pipeline_repaint.py) | *Image Inpainting* | - | - -## Usage example - -```python -from io import BytesIO - -import torch - -import PIL -import requests -from diffusers import RePaintPipeline, RePaintScheduler - - -def download_image(url): - response = requests.get(url) - return PIL.Image.open(BytesIO(response.content)).convert("RGB") - - -img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png" -mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png" - -# Load the original image and the mask as PIL images -original_image = download_image(img_url).resize((256, 256)) -mask_image = download_image(mask_url).resize((256, 256)) - -# Load the RePaint scheduler and pipeline based on a pretrained DDPM model -scheduler = RePaintScheduler.from_pretrained("google/ddpm-ema-celebahq-256") -pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler) -pipe = pipe.to("cuda") - -generator = torch.Generator(device="cuda").manual_seed(0) -output = pipe( - image=original_image, - mask_image=mask_image, - num_inference_steps=250, - eta=0.0, - jump_length=10, - jump_n_sample=10, - generator=generator, -) -inpainted_image = output.images[0] -``` +The original codebase can be found at [andreas128/RePaint](https://github.com/andreas128/RePaint). ## RePaintPipeline [[autodoc]] RePaintPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/score_sde_ve.mdx b/docs/source/en/api/pipelines/score_sde_ve.mdx index 42253e301f4e..29332b1b663c 100644 --- a/docs/source/en/api/pipelines/score_sde_ve.mdx +++ b/docs/source/en/api/pipelines/score_sde_ve.mdx @@ -12,25 +12,18 @@ specific language governing permissions and limitations under the License. # Score SDE VE -## Overview +[Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) (Score SDE) is by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole. This pipeline implements the variance expanding (VE) variant of the stochastic differential equation method. -[Score-Based Generative Modeling through Stochastic Differential Equations](https://arxiv.org/abs/2011.13456) (Score SDE) by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole. +The abstract from the paper is: -The abstract of the paper is the following: +*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* -Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model. - -The original codebase can be found [here](https://github.com/yang-song/score_sde_pytorch). - -This pipeline implements the Variance Expanding (VE) variant of the method. - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_score_sde_ve.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py) | *Unconditional Image Generation* | - | +The original codebase can be found at [yang-song/score_sde_pytorch](https://github.com/yang-song/score_sde_pytorch). ## ScoreSdeVePipeline [[autodoc]] ScoreSdeVePipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx index 7493005b0a1f..768d13b5df6e 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx @@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. -The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) and additional official checkpoints for depth-to-image can be found [here](https://huggingface.co/stabilityai/stable-diffusion-2-depth). +The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) and additional official checkpoints for depth-to-image can be found at [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth). ## StableDiffusionDepth2ImgPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx index c45c5d9c80a5..b621e015d93f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx @@ -14,6 +14,8 @@ specific language governing permissions and limitations under the License. The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. +You can find the original codebases for the inpainting models in the following repositories: + | Stable Diffusion version | Repository | |--------------------------|------------------------------------------------------------------------------------------------------------------------| | v1 | [CompVis/stable-diffusion](https://github.com/runwayml/stable-diffusion#inpainting-with-stable-diffusion) | diff --git a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx index 2da653ffa141..6c9b2a9028ec 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx @@ -18,19 +18,6 @@ The abstract from the paper is: *This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* -Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]: - -```python ->>> from diffusers import StableDiffusionLDM3DPipeline - ->>> pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d") -prompt ="A picture of some lemons on a table" -output = pipe(prompt) -rgb_image, depth_image = output.rgb, output.depth -rgb_image[0].save("lemons_ldm3d_rgb.jpg") -depth_image[0].save("lemons_ldm3d_depth.png") -``` - ## StableDiffusionLDM3DPipeline [[autodoc]] StableDiffusionLDM3DPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx index 5f1a6a4aad5d..02f42c307d6f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx @@ -10,40 +10,32 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Stable diffusion pipelines +# Stable Diffusion pipelines -Stable Diffusion is a text-to-image _latent diffusion_ model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It's trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. +Stable Diffusion is a text-to-image _latent diffusion_ model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. -Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. You can learn more details about it in the [specific pipeline for latent diffusion](pipelines/latent_diffusion) that is part of 🤗 Diffusers. +Stable Diffusion is trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. -For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-announcement) and [this section of our own blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work). +For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and [our own blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. -*Tips*: -- To tweak your prompts on a specific result you liked, you can generate your own latents, as demonstrated in the following notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb) +You can find the original codebases for the different Stable Diffusion versions in the following repositories: -*Overview*: +| Stable Diffusion version | Repository | +|--------------------------|---------------------------------------------------------------------------------| +| v1 | [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) | +| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) | -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [StableDiffusionPipeline](./text2img) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) | [🤗 Stable Diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion) -| [StableDiffusionPipelineSafe](./stable_diffusion_safe) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) | [![Huggingface Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion) -| [StableDiffusionImg2ImgPipeline](./img2img) | *Image-to-Image Text-Guided Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) | [🤗 Diffuse the Rest](https://huggingface.co/spaces/huggingface/diffuse-the-rest) -| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** – *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) | -| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** – *Depth-to-Image Text-Guided Generation* | | -| [StableDiffusionImageVariationPipeline](./image_variation) | **Experimental** – *Image Variation Generation* | | [🤗 Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations) -| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** – *Text-Guided Image Super-Resolution* | | -| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** – *Text-Guided Image Super-Resolution* | | -| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Image Inpainting* | -| [Stable Diffusion 2](./stable_diffusion_2) | *Depth-to-Image Text-Guided Generation* | -| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Super Resolution Image-to-Image* | -| [StableDiffusionLDM3DPipeline](./ldm3d) | *Text-to-(RGB, Depth)* | +Additional official checkpoints for different versions of the Stable Diffusion model for different tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. Feel free to explore these organizations to find the best pipeline for your use-case! ## Tips -### How to load and use different schedulers. +[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible with the [`StableDiffusionPipeline`]. To try out a different scheduler: -The stable diffusion pipeline uses [`PNDMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the stable diffusion pipeline such as [`DDIMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc. -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following: + + +Check out the [Schedulers](../using-diffusers/schedulers) guide for more details about how to change and compare different schedulers. + + ```python >>> from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler @@ -56,12 +48,13 @@ To use a different scheduler, you can either change it via the [`ConfigMixin.fro >>> pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) ``` +To save memory and use the same components across multiple pipelines, use the `.components` method: + + -### How to convert all use cases with multiple or single pipeline +Read the reuse components across pipelines [section](../using-diffusers/loading#reuse-components-across-pipelines) for more details. -If you want to use all possible use cases in a single `DiffusionPipeline` you can either: -- Make use of the [Stable Diffusion Mega Pipeline](https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega) or -- Make use of the `components` functionality to instantiate all components in the most memory-efficient way: + ```python >>> from diffusers import ( @@ -75,7 +68,4 @@ If you want to use all possible use cases in a single `DiffusionPipeline` you ca >>> inpaint = StableDiffusionInpaintPipeline(**text2img.components) >>> # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline -``` - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput +``` \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx index 48d3ec0680d7..51fe19eeebc6 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx @@ -17,7 +17,7 @@ Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the wo *The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* -For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-v2-release). +For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 1191be418315..37e7e30c4857 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -18,14 +18,6 @@ The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .* - -| Stable Diffusion version | Repository | -|--------------------------|---------------------------------------------------------------------------------| -| v1 | [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) | -| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) | - -Additional official checkpoints for different versions of the Stable Diffusion model can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. - ## StableDiffusionPipeline [[autodoc]] StableDiffusionPipeline diff --git a/src/diffusers/pipelines/audio_diffusion/mel.py b/src/diffusers/pipelines/audio_diffusion/mel.py index 1bf28fd25a5a..4bf19ee13215 100644 --- a/src/diffusers/pipelines/audio_diffusion/mel.py +++ b/src/diffusers/pipelines/audio_diffusion/mel.py @@ -37,13 +37,20 @@ class Mel(ConfigMixin, SchedulerMixin): """ Parameters: - x_res (`int`): x resolution of spectrogram (time) - y_res (`int`): y resolution of spectrogram (frequency bins) - sample_rate (`int`): sample rate of audio - n_fft (`int`): number of Fast Fourier Transforms - hop_length (`int`): hop length (a higher number is recommended for lower than 256 y_res) - top_db (`int`): loudest in decibels - n_iter (`int`): number of iterations for Griffin Linn mel inversion + x_res (`int`): + x resolution of spectrogram (time). + y_res (`int`): + y resolution of spectrogram (frequency bins). + sample_rate (`int`): + Sample rate of audio. + n_fft (`int`): + Number of Fast Fourier Transforms. + hop_length (`int`): + Hop length (a higher number is recommended if `y_res` < 256). + top_db (`int`): + Loudest decibel value. + n_iter (`int`): + Number of iterations for Griffin-Lim Mel inversion. """ config_name = "mel_config.json" @@ -74,8 +81,10 @@ def set_resolution(self, x_res: int, y_res: int): """Set resolution. Args: - x_res (`int`): x resolution of spectrogram (time) - y_res (`int`): y resolution of spectrogram (frequency bins) + x_res (`int`): + x resolution of spectrogram (time). + y_res (`int`): + y resolution of spectrogram (frequency bins). """ self.x_res = x_res self.y_res = y_res @@ -86,8 +95,10 @@ def load_audio(self, audio_file: str = None, raw_audio: np.ndarray = None): """Load audio. Args: - audio_file (`str`): must be a file on disk due to Librosa limitation or - raw_audio (`np.ndarray`): audio as numpy array + audio_file (`str`): + An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation. + raw_audio (`np.ndarray`): + The raw audio file as a NumPy array. """ if audio_file is not None: self.audio, _ = librosa.load(audio_file, mono=True, sr=self.sr) @@ -102,7 +113,8 @@ def get_number_of_slices(self) -> int: """Get number of slices in audio. Returns: - `int`: number of spectograms audio can be sliced into + `int`: + Number of spectograms audio can be sliced into. """ return len(self.audio) // self.slice_size @@ -110,10 +122,12 @@ def get_audio_slice(self, slice: int = 0) -> np.ndarray: """Get slice of audio. Args: - slice (`int`): slice number of audio (out of get_number_of_slices()) + slice (`int`): + Slice number of audio (out of `get_number_of_slices()`). Returns: - `np.ndarray`: audio as numpy array + `np.ndarray`: + The audio slice as a NumPy array """ return self.audio[self.slice_size * slice : self.slice_size * (slice + 1)] @@ -121,7 +135,8 @@ def get_sample_rate(self) -> int: """Get sample rate: Returns: - `int`: sample rate of audio + `int`: + Sample rate of audio. """ return self.sr @@ -129,10 +144,12 @@ def audio_slice_to_image(self, slice: int) -> Image.Image: """Convert slice of audio to spectrogram. Args: - slice (`int`): slice number of audio to convert (out of get_number_of_slices()) + slice (`int`): + Slice number of audio to convert (out of `get_number_of_slices()`). Returns: - `PIL Image`: grayscale image of x_res x y_res + `PIL Image`: + A grayscale image of `x_res x y_res`. """ S = librosa.feature.melspectrogram( y=self.get_audio_slice(slice), sr=self.sr, n_fft=self.n_fft, hop_length=self.hop_length, n_mels=self.n_mels @@ -146,10 +163,12 @@ def image_to_audio(self, image: Image.Image) -> np.ndarray: """Converts spectrogram to audio. Args: - image (`PIL Image`): x_res x y_res grayscale image + image (`PIL Image`): + An grayscale image of `x_res x y_res`. Returns: - audio (`np.ndarray`): raw audio + audio (`np.ndarray`): + The audio as a NumPy array. """ bytedata = np.frombuffer(image.tobytes(), dtype="uint8").reshape((image.height, image.width)) log_S = bytedata.astype("float") * self.top_db / 255 - self.top_db diff --git a/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py b/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py index 629a2e7d32ca..107e02a34ecb 100644 --- a/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py +++ b/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py @@ -29,14 +29,21 @@ class AudioDiffusionPipeline(DiffusionPipeline): """ - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for audio diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: - vqae ([`AutoencoderKL`]): Variational AutoEncoder for Latent Audio Diffusion or None - unet ([`UNet2DConditionModel`]): UNET model - mel ([`Mel`]): transform audio <-> spectrogram - scheduler ([`DDIMScheduler` or `DDPMScheduler`]): de-noising scheduler + vqae ([`AutoencoderKL`]): + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. + mel ([`Mel`]): + Transform audio into a spectrogram. + scheduler ([`DDIMScheduler` or `DDPMScheduler`]): + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of + [`DDIMScheduler` or `DDPMScheduler`]. """ _optional_components = ["vqvae"] @@ -80,26 +87,90 @@ def __call__( Union[AudioPipelineOutput, ImagePipelineOutput], Tuple[List[Image.Image], Tuple[int, List[np.ndarray]]], ]: - """Generate random mel spectrogram from audio input and convert to audio. + """ + The call function to the pipeline for generation. Args: - batch_size (`int`): number of samples to generate - audio_file (`str`): must be a file on disk due to Librosa limitation or - raw_audio (`np.ndarray`): audio as numpy array - slice (`int`): slice number of audio to convert - start_step (int): step to start from - steps (`int`): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM) - generator (`torch.Generator`): random number generator or None - mask_start_secs (`float`): number of seconds of audio to mask (not generate) at start - mask_end_secs (`float`): number of seconds of audio to mask (not generate) at end - step_generator (`torch.Generator`): random number generator used to de-noise or None - eta (`float`): parameter between 0 and 1 used with DDIM scheduler - noise (`torch.Tensor`): noise tensor of shape (batch_size, 1, height, width) or None - encoding (`torch.Tensor`): for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim) - return_dict (`bool`): if True return AudioPipelineOutput, ImagePipelineOutput else Tuple + batch_size (`int`): + Number of samples to generate. + audio_file (`str`): + An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation. + raw_audio (`np.ndarray`): + The raw audio file as a NumPy array. + slice (`int`): + Slice number of audio to convert. + start_step (int): + Step to start diffusion from. + steps (`int`): + Number of denoising steps (defaults to `50` for DDIM and `1000` for DDPM). + generator (`torch.Generator`): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + mask_start_secs (`float`): + Number of seconds of audio to mask (not generate) at start. + mask_end_secs (`float`): + Number of seconds of audio to mask (not generate) at end. + step_generator (`torch.Generator`): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) used to denoise. + None + eta (`float`): + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + noise (`torch.Tensor`): + A noise tensor of shape `(batch_size, 1, height, width)` or `None`. + encoding (`torch.Tensor`): + for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim) + return_dict (`bool`): + Whether or not to return a [`AudioPipelineOutput`], [`ImagePipelineOutput`] instead of a plain tuple. + + Examples: + + For audio diffusion: + + ```py + import torch + from IPython.display import Audio + from diffusers import DiffusionPipeline + + device = "cuda" if torch.cuda.is_available() else "cpu" + pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device) + + output = pipe() + display(output.images[0]) + display(Audio(output.audios[0], rate=mel.get_sample_rate())) + ``` + + For latent audio diffusion: + + ```py + import torch + from IPython.display import Audio + from diffusers import DiffusionPipeline + + device = "cuda" if torch.cuda.is_available() else "cpu" + pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device) + + output = pipe() + display(output.images[0]) + display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) + ``` + + For other tasks like variation, inpainting, outpainting, etc: + + ```py + output = pipe( + raw_audio=output.audios[0, 0], + start_step=int(pipe.get_default_steps() / 2), + mask_start_secs=1, + mask_end_secs=1, + ) + display(output.images[0]) + display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate())) + ``` Returns: - `List[PIL Image]`: mel spectrograms (`float`, `List[np.ndarray]`): sample rate and raw audios + `List[PIL Image]`: + A list of Mel spectrograms (`float`, `List[np.ndarray]`) with the sample rate and raw audio. """ steps = steps or self.get_default_steps() @@ -197,14 +268,18 @@ def __call__( @torch.no_grad() def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray: - """Reverse step process: recover noisy image from generated image. + """ + Reverse the denoising step process to recover a noisy image from the generated image. Args: - images (`List[PIL Image]`): list of images to encode - steps (`int`): number of encoding steps to perform (defaults to 50) + images (`List[PIL Image]`): + List of images to encode. + steps (`int`): + Number of encoding steps to perform (defaults to `50`) Returns: - `np.ndarray`: noise tensor of shape (batch_size, 1, height, width) + `np.ndarray`: + A noise tensor of shape `(batch_size, 1, height, width)`. """ # Only works with DDIM as this method is deterministic @@ -237,12 +312,16 @@ def slerp(x0: torch.Tensor, x1: torch.Tensor, alpha: float) -> torch.Tensor: """Spherical Linear intERPolation Args: - x0 (`torch.Tensor`): first tensor to interpolate between - x1 (`torch.Tensor`): seconds tensor to interpolate between - alpha (`float`): interpolation between 0 and 1 + x0 (`torch.Tensor`): + The first tensor to interpolate between. + x1 (`torch.Tensor`): + Second tensor to interpolate between. + alpha (`float`): + Interpolation between 0 and 1 Returns: - `torch.Tensor`: interpolated tensor + `torch.Tensor`: + The interpolated tensor. """ theta = acos(torch.dot(torch.flatten(x0), torch.flatten(x1)) / torch.norm(x0) / torch.norm(x1)) diff --git a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py index fe204afa7436..1286ef5c14f0 100644 --- a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py +++ b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py @@ -31,14 +31,19 @@ EXAMPLE_DOC_STRING = """ Examples: ```py - >>> import torch >>> from diffusers import AudioLDMPipeline + >>> import torch + >>> import scipy - >>> pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16) + >>> repo_id = "cvssp/audioldm-s-full-v2" + >>> pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) >>> pipe = pipe.to("cuda") - >>> prompt = "A hammer hitting a wooden surface" - >>> audio = pipe(prompt).audios[0] + >>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" + >>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0] + + >>> # save the audio sample as a .wav file + >>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio) ``` """ @@ -47,26 +52,24 @@ class AudioLDMPipeline(DiffusionPipeline): r""" Pipeline for text-to-audio generation using AudioLDM. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode audios to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`ClapTextModelWithProjection`]): - Frozen text-encoder. AudioLDM uses the text portion of - [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap#transformers.ClapTextModelWithProjection), - specifically the [RoBERTa HSTAT-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant. + Frozen text-encoder ([`~transformers.ClapTextModelWithProjection`], specifically the + [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant. tokenizer ([`PreTrainedTokenizer`]): - Tokenizer of class - [RobertaTokenizer](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer). - unet ([`UNet2DConditionModel`]): U-Net architecture to denoise the encoded audio latents. + A [`~transformers.RobertaTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded audio latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. vocoder ([`SpeechT5HifiGan`]): - Vocoder of class - [SpeechT5HifiGan](https://huggingface.co/docs/transformers/main/en/model_doc/speecht5#transformers.SpeechT5HifiGan). + Vocoder of class [`~transformers.SpeechT5HifiGan`]. """ def __init__( @@ -380,70 +383,62 @@ def __call__( output_type: Optional[str] = "np", ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the audio generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide audio generation. If not defined, you need to pass `prompt_embeds`. audio_length_in_s (`int`, *optional*, defaults to 5.12): The length of the generated audio sample in seconds. num_inference_steps (`int`, *optional*, defaults to 10): The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 2.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate audios that are closely linked to the text `prompt`, - usually at the expense of lower sound quality. + A higher guidance scale value encourages the model to generate audio that is closely linked to the text + `prompt` at the expense of lower sound quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the audio generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in audio generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_waveforms_per_prompt (`int`, *optional*, defaults to 1): The number of waveforms to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttnProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). output_type (`str`, *optional*, defaults to `"np"`): - The output format of the generate image. Choose between: - - `"np"`: Return Numpy `np.ndarray` objects. - - `"pt"`: Return PyTorch `torch.Tensor` objects. + The output format of the generated image. Choose between `"np"` to return a NumPy `np.ndarray` or + `"pt"` to return a PyTorch `torch.Tensor` object. Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated audios. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated audio. """ # 0. Convert audio input length from seconds to spectrogram height vocoder_upsample_factor = np.prod(self.vocoder.config.upsample_rates) / self.vocoder.config.sampling_rate diff --git a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py index ec4af7afe5ad..55772a83ca78 100644 --- a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py +++ b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py @@ -50,20 +50,17 @@ class ConsistencyModelPipeline(DiffusionPipeline): r""" - Pipeline for consistency models for unconditional or class-conditional image generation, as introduced in [1]. + Pipeline for unconditional or class-conditional image generation. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) - - [1] Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya. "Consistency Models" - https://arxiv.org/pdf/2303.01469 + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: unet ([`UNet2DModel`]): - Unconditional or class-conditional U-Net architecture to denoise image latents. + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the image latents. Currently only compatible - with [`CMStochasticIterativeScheduler`]. + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Currently only + compatible with [`CMStochasticIterativeScheduler`]. """ def __init__(self, unet: UNet2DModel, scheduler: CMStochasticIterativeScheduler) -> None: @@ -78,10 +75,10 @@ def __init__(self, unet: UNet2DModel, scheduler: CMStochasticIterativeScheduler) def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -201,8 +198,8 @@ def __call__( batch_size (`int`, *optional*, defaults to 1): The number of images to generate. class_labels (`torch.Tensor` or `List[int]` or `int`, *optional*): - Optional class labels for conditioning class-conditional consistency models. Will not be used if the - model is not class-conditional. + Optional class labels for conditioning class-conditional consistency models. Not used if the model is + not class-conditional. num_inference_steps (`int`, *optional*, defaults to 1): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. @@ -210,29 +207,29 @@ def __call__( Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps` timesteps are used. Must be in descending order. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated images. """ # 0. Prepare call parameters img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py index c3eb32273b6d..d4f3887b6035 100644 --- a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py +++ b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py @@ -26,13 +26,16 @@ class DanceDiffusionPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for audio generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: - unet ([`UNet1DModel`]): U-Net architecture to denoise the encoded audio. + unet ([`UNet1DModel`]): + A [`UNet1DModel`] to denoise the encoded audio. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded audio. Can be one of + A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of [`IPNDMScheduler`]. """ @@ -50,6 +53,8 @@ def __call__( return_dict: bool = True, ) -> Union[AudioPipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): The number of audio samples to generate. @@ -57,17 +62,41 @@ def __call__( The number of denoising steps. More denoising steps usually lead to a higher-quality audio sample at the expense of slower inference. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. audio_length_in_s (`float`, *optional*, defaults to `self.unet.config.sample_size/self.unet.config.sample_rate`): - The length of the generated audio sample in seconds. Note that the output of the pipeline, *i.e.* - `sample_size`, will be `audio_length_in_s` * `self.unet.config.sample_rate`. + The length of the generated audio sample in seconds. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.AudioPipelineOutput`] instead of a plain tuple. + Example: + + ```py + # !pip install diffusers[torch] accelerate scipy + from diffusers import DiffusionPipeline + from scipy.io.wavfile import write + + model_id = "harmonai/maestro-150k" + pipe = DiffusionPipeline.from_pretrained(model_id) + pipe = pipe.to("cuda") + + audios = pipe(audio_length_in_s=4.0).audios + + # To save locally + for i, audio in enumerate(audios): + write(f"maestro_test_{i}.wav", pipe.unet.sample_rate, audio.transpose()) + + # To dislay in google colab + import IPython.display as ipd + + for audio in audios: + display(ipd.Audio(audio, rate=pipe.unet.sample_rate)) + ``` + Returns: - [`~pipelines.AudioPipelineOutput`] or `tuple`: [`~pipelines.utils.AudioPipelineOutput`] if `return_dict` is - True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated audio. + [`~pipelines.AudioPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.AudioPipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated audio. """ if audio_length_in_s is None: diff --git a/src/diffusers/pipelines/ddim/pipeline_ddim.py b/src/diffusers/pipelines/ddim/pipeline_ddim.py index c24aa6c79793..06dbf1d525c5 100644 --- a/src/diffusers/pipelines/ddim/pipeline_ddim.py +++ b/src/diffusers/pipelines/ddim/pipeline_ddim.py @@ -23,11 +23,14 @@ class DDIMPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for image generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of [`DDPMScheduler`], or [`DDIMScheduler`]. @@ -53,29 +56,57 @@ def __call__( return_dict: bool = True, ) -> Union[ImagePipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): The number of images to generate. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. eta (`float`, *optional*, defaults to 0.0): - The eta parameter which controls the scale of the variance (0 is DDIM and 1 is one type of DDPM). + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. A value of `0` corresponds to + DDIM and `1` corresponds to DDPM. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. use_clipped_model_output (`bool`, *optional*, defaults to `None`): - if `True` or `False`, see documentation for `DDIMScheduler.step`. If `None`, nothing is passed - downstream to the scheduler. So use `None` for schedulers which don't support this argument. + If `True` or `False`, see documentation for [`DDIMScheduler.step`]. If `None`, nothing is passed + downstream to the scheduler (use `None` for schedulers which don't support this argument). output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Example: + + ```py + >>> # !pip install diffusers + >>> from diffusers import DDIMPipeline + >>> import PIL.Image + >>> import numpy as np + + >>> # load model and scheduler + >>> pipe = DDIMPipeline.from_pretrained("fusing/ddim-lsun-bedroom") + + >>> # run pipeline in inference (sample random noise and denoise) + >>> image = pipe(eta=0.0, num_inference_steps=50) + + >>> # process image to PIL + >>> image_processed = image.cpu().permute(0, 2, 3, 1) + >>> image_processed = (image_processed + 1.0) * 127.5 + >>> image_processed = image_processed.numpy().astype(np.uint8) + >>> image_pil = PIL.Image.fromarray(image_processed[0]) + + >>> # save image + >>> image_pil.save("test.png") + ``` + Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ # Sample gaussian noise to begin loop diff --git a/src/diffusers/pipelines/ddpm/pipeline_ddpm.py b/src/diffusers/pipelines/ddpm/pipeline_ddpm.py index b4290daf852c..ef62243501dc 100644 --- a/src/diffusers/pipelines/ddpm/pipeline_ddpm.py +++ b/src/diffusers/pipelines/ddpm/pipeline_ddpm.py @@ -23,11 +23,14 @@ class DDPMPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for image generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of [`DDPMScheduler`], or [`DDIMScheduler`]. @@ -47,24 +50,42 @@ def __call__( return_dict: bool = True, ) -> Union[ImagePipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): The number of images to generate. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. - num_inference_steps (`int`, *optional*, defaults to 1000): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Example: + + ```py + >>> # !pip install diffusers + >>> from diffusers import DDPMPipeline + + >>> # load model and scheduler + >>> pipe = DDPMPipeline.from_pretrained("google/ddpm-cat-256") + + >>> # run pipeline in inference (sample random noise and denoise) + >>> image = pipe().images[0] + + >>> # save image + >>> image.save("ddpm_generated_image.png") + ``` + Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ # Sample gaussian noise to begin loop if isinstance(self.unet.config.sample_size, int): diff --git a/src/diffusers/pipelines/dit/pipeline_dit.py b/src/diffusers/pipelines/dit/pipeline_dit.py index 07fd2835ccf0..5efd86d88aca 100644 --- a/src/diffusers/pipelines/dit/pipeline_dit.py +++ b/src/diffusers/pipelines/dit/pipeline_dit.py @@ -30,16 +30,18 @@ class DiTPipeline(DiffusionPipeline): r""" - This pipeline inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for image generation based on a Transformer backbone instead of a UNet. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: transformer ([`Transformer2DModel`]): - Class conditioned Transformer in Diffusion model to denoise the encoded image latents. + A [`Transformer2DModel`] to denoise the encoded image latents. vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. scheduler ([`DDIMScheduler`]): - A scheduler to be used in combination with `dit` to denoise the encoded image latents. + A scheduler to be used in combination with `transformer` to denoise the encoded image latents. """ def __init__( @@ -63,13 +65,15 @@ def __init__( def get_label_ids(self, label: Union[str, List[str]]) -> List[int]: r""" - Map label strings, *e.g.* from ImageNet, to corresponding class ids. + Map label strings from ImageNet to corresponding class ids. Parameters: - label (`str` or `dict` of `str`): label strings to be mapped to class ids. + label (`str` or `dict` of `str`): + Label strings to be mapped to class ids. Returns: - `list` of `int`: Class ids to be processed by pipeline. + `list` of `int`: + Class ids to be processed by pipeline. """ if not isinstance(label, list): @@ -94,24 +98,53 @@ def __call__( return_dict: bool = True, ) -> Union[ImagePipelineOutput, Tuple]: r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: class_labels (List[int]): - List of imagenet class labels for the images to be generated. + List of ImageNet class labels for the images to be generated. guidance_scale (`float`, *optional*, defaults to 4.0): - Scale of the guidance signal. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. generator (`torch.Generator`, *optional*): - A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation - deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. num_inference_steps (`int`, *optional*, defaults to 250): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. + + Examples: + + ```py + >>> from diffusers import DiTPipeline, DPMSolverMultistepScheduler + >>> import torch + + >>> pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16) + >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) + >>> pipe = pipe.to("cuda") + + >>> # pick words from Imagenet class labels + >>> pipe.labels # to print all available words + + >>> # pick words that exist in ImageNet + >>> words = ["white shark", "umbrella"] + + >>> class_ids = pipe.get_label_ids(words) + + >>> generator = torch.manual_seed(33) + >>> output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator) + + >>> image = output.images[0] # label 'white shark' + ``` + + Returns: + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ batch_size = len(class_labels) diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py index ab7c28b96cc8..cfe620d9c2e6 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py @@ -31,18 +31,20 @@ class LDMTextToImagePipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for text-to-image generation using latent diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) Model to encode and decode images to and from latent representations. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): - Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. + Text-encoder model based on [`~transformers.BERT`]. tokenizer (`transformers.BertTokenizer`): - Tokenizer of class - [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.BertTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -76,38 +78,54 @@ def __call__( **kwargs, ) -> Union[Tuple, ImagePipelineOutput]: r""" + The call function to the pipeline for generation. + Args: prompt (`str` or `List[str]`): The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 1.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt` at - the, usually at the expense of lower image quality. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. - return_dict (`bool`, *optional*): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + >>> # !pip install diffusers transformers + >>> from diffusers import DiffusionPipeline + + >>> # load model and scheduler + >>> ldm = DiffusionPipeline.from_pretrained("CompVis/ldm-text2im-large-256") + + >>> # run pipeline in inference (sample random noise and denoise) + >>> prompt = "A painting of a squirrel eating a burger" + >>> images = ldm([prompt], num_inference_steps=50, eta=0.3, guidance_scale=6).images + + >>> # save images + >>> for idx, image in enumerate(images): + ... image.save(f"squirrel-{idx}.png") + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py index ae620d325307..09bef71497b5 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py @@ -31,15 +31,16 @@ def preprocess(image): class LDMSuperResolutionPipeline(DiffusionPipeline): r""" - A pipeline for image super-resolution using Latent + A pipeline for image super-resolution using latent diffusion - This class inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) VAE Model to encode and decode images to and from latent representations. - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], @@ -74,30 +75,59 @@ def __call__( return_dict: bool = True, ) -> Union[Tuple, ImagePipelineOutput]: r""" + The call function to the pipeline for generation. + Args: image (`torch.Tensor` or `PIL.Image.Image`): - `Image`, or tensor representing an image batch, that will be used as the starting point for the - process. + `Image` or tensor representing an image batch to be used as the starting point for the process. batch_size (`int`, *optional*, defaults to 1): Number of images to generate. num_inference_steps (`int`, *optional*, defaults to 100): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. - return_dict (`bool`, *optional*): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + >>> #!pip install git+https://github.com/huggingface/diffusers.git + >>> import requests + >>> from PIL import Image + >>> from io import BytesIO + >>> from diffusers import LDMSuperResolutionPipeline + >>> import torch + + >>> # load model and scheduler + >>> pipeline = LDMSuperResolutionPipeline.from_pretrained("CompVis/ldm-super-resolution-4x-openimages") + >>> pipeline = pipeline.to("cuda") + + >>> # let's download an image + >>> url = ( + ... "https://user-images.githubusercontent.com/38061659/199705896-b48e17b8-b231-47cd-a270-4ffa5a93fa3e.png" + ... ) + >>> response = requests.get(url) + >>> low_res_img = Image.open(BytesIO(response.content)).convert("RGB") + >>> low_res_img = low_res_img.resize((128, 128)) + + >>> # run pipeline in inference (sample random noise and denoise) + >>> upscaled_image = pipeline(low_res_img, num_inference_steps=100, eta=1).images[0] + >>> # save image + >>> upscaled_image.save("ldm_generated_image.png") + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ if isinstance(image, PIL.Image.Image): batch_size = 1 diff --git a/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py b/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py index f844834b527d..9962d112bd66 100644 --- a/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py +++ b/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py @@ -136,28 +136,35 @@ def prepare_mask_and_masked_image(image, mask): class PaintByExamplePipeline(DiffusionPipeline): r""" - Pipeline for image-guided image inpainting using Stable Diffusion. *This is an experimental feature*. + + + Pipeline for image-guided image inpainting using Stable Diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. image_encoder ([`PaintByExampleImageEncoder`]): - Encodes the example input image. The unet is conditioned on the example image instead of a text prompt. + Encodes the example input image. The UNet is conditioned on the example image instead of a text prompt. tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ # TODO: feature_extractor is required to encode initial images (if they are in PIL format), # we should give a descriptive message if the pipeline doesn't have one. @@ -378,66 +385,101 @@ def __call__( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: example_image (`torch.FloatTensor` or `PIL.Image.Image` or `List[PIL.Image.Image]`): - The exemplar image to guide the image generation. + An example image to guide image generation. image (`torch.FloatTensor` or `PIL.Image.Image` or `List[PIL.Image.Image]`): - `Image`, or tensor representing an image batch which will be inpainted, *i.e.* parts of the image will - be masked out with `mask_image` and repainted according to `prompt`. + `Image` or tensor representing an image batch to be inpainted (parts of the image are masked out with + `mask_image` and repainted according to `prompt`). mask_image (`torch.FloatTensor` or `PIL.Image.Image` or `List[PIL.Image.Image]`): - `Image`, or tensor representing an image batch, to mask `image`. White pixels in the mask will be - repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted - to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L) - instead of 3, so the expected shape would be `(B, H, W, 1)`. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + `Image` or tensor representing an image batch to mask `image`. White pixels in the mask are repainted, + while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a single channel + (luminance) before use. If it's a tensor, it should contain one color channel (L) instead of 3, so the + expected shape would be `(B, H, W, 1)`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. + + Example: + + ```py + >>> # !pip install diffusers transformers + + >>> import PIL + >>> import requests + >>> import torch + >>> from io import BytesIO + >>> from diffusers import PaintByExamplePipeline + + + >>> def download_image(url): + ... response = requests.get(url) + ... return PIL.Image.open(BytesIO(response.content)).convert("RGB") + + + >>> img_url = ( + ... "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/image/example_1.png" + ... ) + >>> mask_url = ( + ... "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/mask/example_1.png" + ... ) + >>> example_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/reference/example_1.jpg" + + >>> init_image = download_image(img_url).resize((512, 512)) + >>> mask_image = download_image(mask_url).resize((512, 512)) + >>> example_image = download_image(example_url).resize((512, 512)) + + >>> pipe = PaintByExamplePipeline.from_pretrained( + ... "Fantasy-Studio/Paint-by-Example", + ... torch_dtype=torch.float16, + ... ) + >>> pipe = pipe.to("cuda") + + >>> image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0] + >>> image + ``` Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Define call parameters if isinstance(image, PIL.Image.Image): diff --git a/src/diffusers/pipelines/pndm/pipeline_pndm.py b/src/diffusers/pipelines/pndm/pipeline_pndm.py index 361444079311..ffe7f8b3b94d 100644 --- a/src/diffusers/pipelines/pndm/pipeline_pndm.py +++ b/src/diffusers/pipelines/pndm/pipeline_pndm.py @@ -25,13 +25,16 @@ class PNDMPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for unconditional image generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: - unet (`UNet2DModel`): U-Net architecture to denoise the encoded image latents. - scheduler ([`SchedulerMixin`]): - The `PNDMScheduler` to be used in combination with `unet` to denoise the encoded image. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image latents. + scheduler ([`PNDMScheduler`]): + A [`PNDMScheduler`] to be used in combination with `unet` to denoise the encoded image. """ unet: UNet2DModel @@ -55,22 +58,42 @@ def __call__( **kwargs, ) -> Union[ImagePipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: - batch_size (`int`, `optional`, defaults to 1): The number of images to generate. + batch_size (`int`, `optional`, defaults to 1): + The number of images to generate. num_inference_steps (`int`, `optional`, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. generator (`torch.Generator`, `optional`): A [torch - generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation - deterministic. - output_type (`str`, `optional`, defaults to `"pil"`): The output format of the generate image. Choose - between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. - return_dict (`bool`, `optional`, defaults to `True`): Whether or not to return a - [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + output_type (`str`, `optional`, defaults to `"pil"`): + The output format of the generated image. Choose between `PIL.Image` or `np.array`. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + >>> # !pip install diffusers + >>> from diffusers import PNDMPipeline + + >>> # load model and scheduler + >>> pndm = PNDMPipeline.from_pretrained("google/ddpm-cifar10-32") + + >>> # run pipeline in inference (sample random noise and denoise) + >>> image = pndm().images[0] + + >>> # save image + >>> image.save("pndm_generated_image.png") + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated images. """ # For more information on the sampling method you can take a look at Algorithm 2 of # the official paper: https://arxiv.org/pdf/2202.09778.pdf diff --git a/src/diffusers/pipelines/repaint/pipeline_repaint.py b/src/diffusers/pipelines/repaint/pipeline_repaint.py index 6527a023a74f..d9768ea85226 100644 --- a/src/diffusers/pipelines/repaint/pipeline_repaint.py +++ b/src/diffusers/pipelines/repaint/pipeline_repaint.py @@ -77,6 +77,19 @@ def _preprocess_mask(mask: Union[List, PIL.Image.Image, torch.Tensor]): class RePaintPipeline(DiffusionPipeline): + r""" + Pipeline for image inpainting using RePaint. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). + + Parameters: + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image latents. + scheduler ([`RePaintScheduler`]): + A [`RePaintScheduler`] to be used in combination with `unet` to denoise the encoded image. + """ + unet: UNet2DModel scheduler: RePaintScheduler @@ -98,35 +111,77 @@ def __call__( return_dict: bool = True, ) -> Union[ImagePipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: image (`torch.FloatTensor` or `PIL.Image.Image`): The original image to inpaint on. mask_image (`torch.FloatTensor` or `PIL.Image.Image`): - The mask_image where 0.0 values define which part of the original image to inpaint (change). + The mask_image where `0.0` define which part of the original image to inpaint. num_inference_steps (`int`, *optional*, defaults to 1000): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. eta (`float`): - The weight of noise for added noise in a diffusion step. Its value is between 0.0 and 1.0 - 0.0 is DDIM - and 1.0 is DDPM scheduler respectively. + The weight of the added noise in a diffusion step. Its value is between 0.0 and 1.0; 0.0 corresponds to + DDIM and 1.0 is the DDPM scheduler. jump_length (`int`, *optional*, defaults to 10): The number of steps taken forward in time before going backward in time for a single jump ("j" in RePaint paper). Take a look at Figure 9 and 10 in https://arxiv.org/pdf/2201.09865.pdf. jump_n_sample (`int`, *optional*, defaults to 10): - The number of times we will make forward time jump for a given chosen time sample. Take a look at - Figure 9 and 10 in https://arxiv.org/pdf/2201.09865.pdf. + The number of times to make a forward time jump for a given chosen time sample. Take a look at Figure 9 + and 10 in https://arxiv.org/pdf/2201.09865.pdf. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. - output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + output_type (`str`, `optional`, defaults to `"pil"`): + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + >>> from io import BytesIO + >>> import torch + >>> import PIL + >>> import requests + >>> from diffusers import RePaintPipeline, RePaintScheduler + + + >>> def download_image(url): + ... response = requests.get(url) + ... return PIL.Image.open(BytesIO(response.content)).convert("RGB") + + + >>> img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png" + >>> mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png" + + >>> # Load the original image and the mask as PIL images + >>> original_image = download_image(img_url).resize((256, 256)) + >>> mask_image = download_image(mask_url).resize((256, 256)) + + >>> # Load the RePaint scheduler and pipeline based on a pretrained DDPM model + >>> scheduler = RePaintScheduler.from_pretrained("google/ddpm-ema-celebahq-256") + >>> pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler) + >>> pipe = pipe.to("cuda") + + >>> generator = torch.Generator(device="cuda").manual_seed(0) + >>> output = pipe( + ... image=original_image, + ... mask_image=mask_image, + ... num_inference_steps=250, + ... eta=0.0, + ... jump_length=10, + ... jump_n_sample=10, + ... generator=generator, + ... ) + >>> inpainted_image = output.images[0] + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated images. """ original_image = image diff --git a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py index 3ff7b8ee460b..2c171b611581 100644 --- a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py +++ b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py @@ -24,11 +24,16 @@ class ScoreSdeVePipeline(DiffusionPipeline): r""" + Pipeline for unconditional image generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). + Parameters: - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. scheduler ([`SchedulerMixin`]): - The [`ScoreSdeVeScheduler`] scheduler to be used in combination with `unet` to denoise the encoded image. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image. + scheduler ([`ScoreSdeVeScheduler`]): + A [`ScoreSdeVeScheduler`] scheduler to be used in combination with `unet` to denoise the encoded image. """ unet: UNet2DModel scheduler: ScoreSdeVeScheduler @@ -48,21 +53,29 @@ def __call__( **kwargs, ) -> Union[ImagePipelineOutput, Tuple]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): The number of images to generate. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. - output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + generator (`torch.Generator`, `optional`): A [torch + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + output_type (`str`, `optional`, defaults to `"pil"`): + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated images. """ img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py index 316a83125468..61ce824457c9 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py @@ -130,28 +130,27 @@ class CycleDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor r""" Pipeline for text-guided image to image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -595,71 +594,134 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): The prompt or prompts to guide the image generation. image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch, that will be used as the starting point for the - process. Can also accpet image latents as `image`, if passing latents directly, it will not be encoded - again. + `Image` or tensor representing an image batch to be used as the starting point. Can also accept image + latents as `image`, if passing latents directly, it will not be encoded again. strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in - `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. + Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a + starting point and more noise is added the higher the `strength`. The number of denoising steps depends + on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising + process runs for the full number of iterations specified in `num_inference_steps`. A value of 1 + essentially ignores `image`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. This parameter will be modulated by `strength`. + expense of slower inference. This parameter is modulated by `strength`. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. source_guidance_scale (`float`, *optional*, defaults to 1): Guidance scale for the source prompt. This is useful to control the amount of influence the source - prompt for encoding. + prompt has for encoding. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. - eta (`float`, *optional*, defaults to 0.1): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + eta (`float`, *optional*, defaults to 0.0): + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + + Example: + + ```py + import requests + import torch + from PIL import Image + from io import BytesIO + + from diffusers import CycleDiffusionPipeline, DDIMScheduler + + # load the pipeline + # make sure you're logged in with `huggingface-cli login` + model_id_or_path = "CompVis/stable-diffusion-v1-4" + scheduler = DDIMScheduler.from_pretrained(model_id_or_path, subfolder="scheduler") + pipe = CycleDiffusionPipeline.from_pretrained(model_id_or_path, scheduler=scheduler).to("cuda") + + # let's download an initial image + url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/An%20astronaut%20riding%20a%20horse.png" + response = requests.get(url) + init_image = Image.open(BytesIO(response.content)).convert("RGB") + init_image = init_image.resize((512, 512)) + init_image.save("horse.png") + + # let's specify a prompt + source_prompt = "An astronaut riding a horse" + prompt = "An astronaut riding an elephant" + + # call the pipeline + image = pipe( + prompt=prompt, + source_prompt=source_prompt, + image=init_image, + num_inference_steps=100, + eta=0.1, + strength=0.8, + guidance_scale=2, + source_guidance_scale=1, + ).images[0] + + image.save("horse_to_elephant.png") + + # let's try another example + # See more samples at the original repo: https://github.com/ChenWu98/cycle-diffusion + url = ( + "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/A%20black%20colored%20car.png" + ) + response = requests.get(url) + init_image = Image.open(BytesIO(response.content)).convert("RGB") + init_image = init_image.resize((512, 512)) + init_image.save("black.png") + + source_prompt = "A black colored car" + prompt = "A blue colored car" + + # call the pipeline + torch.manual_seed(0) + image = pipe( + prompt=prompt, + source_prompt=source_prompt, + image=init_image, + num_inference_steps=100, + eta=0.1, + strength=0.85, + guidance_scale=3, + source_guidance_scale=1, + ).images[0] + + image.save("black_to_blue.png") + ``` Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs self.check_inputs(prompt, strength, callback_steps) diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py index 7541cbef88ab..2e4cd6bc68f4 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py @@ -166,28 +166,27 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion r""" Pipeline for text-to-image generation using Stable Diffusion and Attend and Excite. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -700,75 +699,66 @@ def __call__( attn_res: Optional[Tuple[int]] = (16, 16), ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. token_indices (`List[int]`): The token indices to alter with attend-and-excite. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). max_iter_to_alter (`int`, *optional*, defaults to `25`): - Number of denoising steps to apply attend-and-excite. The first denoising steps are - where the attend-and-excite is applied. I.e. if `max_iter_to_alter` is 25 and there are a total of `30` - denoising steps, the first 25 denoising steps will apply attend-and-excite and the last 5 will not - apply attend-and-excite. + Number of denoising steps to apply attend-and-excite. The `max_iter_to_alter` denoising steps are when + attend-and-excite is applied. For example, if `max_iter_to_alter` is `25` and there are a total of `30` + denoising steps, the first `25` denoising steps applies attend-and-excite and the last `5` will not. thresholds (`dict`, *optional*, defaults to `{0: 0.05, 10: 0.5, 20: 0.8}`): Dictionary defining the iterations and desired thresholds to apply iterative latent refinement in. scale_factor (`int`, *optional*, default to 20): - Scale factor that controls the step size of each Attend and Excite update. + Scale factor to control the step size of each attend-and-excite update. attn_res (`tuple`, *optional*, default computed from width and height): The 2D resolution of the semantic attention map. @@ -776,10 +766,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. :type attention_store: object + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py index 07fd7ca61ad0..485c11989bf0 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py @@ -239,10 +239,16 @@ def preprocess_mask(mask, batch_size: int = 1): class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for text-guided image inpainting using Stable Diffusion using DiffEdit. *This is an experimental feature*. + - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This is an experimental feature! + + + + Pipeline for text-guided image inpainting using Stable Diffusion and DiffEdit. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] @@ -253,24 +259,23 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. inverse_scheduler (`[DDIMInverseScheduler]`): - A scheduler to be used in combination with `unet` to fill in the unmasked part of the input latents + A scheduler to be used in combination with `unet` to fill in the unmasked part of the input latents. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor", "inverse_scheduler"] @@ -823,6 +828,7 @@ def get_epsilon(self, model_output: torch.Tensor, sample: torch.Tensor, timestep ) @torch.no_grad() + @replace_example_docstring(EXAMPLE_DOC_STRING) def generate_mask( self, image: Union[torch.FloatTensor, PIL.Image.Image] = None, @@ -844,48 +850,42 @@ def generate_mask( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function used to generate a latent mask given a mask prompt, a target prompt, and an image. + Generate a latent mask given a mask prompt, a target prompt, and an image. Args: image (`PIL.Image.Image`): - `Image`, or tensor representing an image batch which will be used for computing the mask. + `Image` or tensor representing an image batch to be used for computing the mask. target_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the semantic mask generation. If not defined, one has to pass - `prompt_embeds`. instead. + The prompt or prompts to guide semantic mask generation. If not defined, you need to pass + `prompt_embeds`. target_negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). target_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. target_negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. source_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the semantic mask generation using the method in [DiffEdit: - Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf). If - not defined, one has to pass `source_prompt_embeds` or `source_image` instead. + The prompt or prompts to guide semantic mask generation using DiffEdit. If not defined, you need to + pass `source_prompt_embeds` or `source_image` instead. source_negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the semantic mask generation away from using the method in [DiffEdit: - Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf). If - not defined, one has to pass `source_negative_prompt_embeds` or `source_image` instead. + The prompt or prompts to guide semantic mask generation away from using DiffEdit. If not defined, you + need to pass `source_negative_prompt_embeds` or `source_image` instead. source_prompt_embeds (`torch.FloatTensor`, *optional*): Pre-generated text embeddings to guide the semantic mask generation. Can be used to easily tweak text - inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from - `source_prompt` input argument. + inputs (prompt weighting). If not provided, text embeddings are generated from `source_prompt` input + argument. source_negative_prompt_embeds (`torch.FloatTensor`, *optional*): Pre-generated text embeddings to negatively guide the semantic mask generation. Can be used to easily - tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from + tweak text inputs (prompt weighting). If not provided, text embeddings are generated from `source_negative_prompt` input argument. num_maps_per_mask (`int`, *optional*, defaults to 10): - The number of noise maps sampled to generate the semantic mask using the method in [DiffEdit: - Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf). + The number of noise maps sampled to generate the semantic mask using DiffEdit. mask_encode_strength (`float`, *optional*, defaults to 0.5): - Conceptually, the strength of the noise maps sampled to generate the semantic mask using the method in - [DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance]( - https://arxiv.org/pdf/2210.11427.pdf). Must be between 0 and 1. + The strength of the noise maps sampled to generate the semantic mask using DiffEdit. Must be between 0 + and 1. mask_thresholding_ratio (`float`, *optional*, defaults to 3.0): The maximum multiple of the mean absolute difference used to clamp the semantic guidance map before mask binarization. @@ -893,30 +893,25 @@ def generate_mask( The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: - `List[PIL.Image.Image]` or `np.array`: `List[PIL.Image.Image]` if `output_type` is `"pil"`, otherwise a - `np.array`. When returning a `List[PIL.Image.Image]`, the list will consist of a batch of single-channel - binary image with dimensions `(height // self.vae_scale_factor, width // self.vae_scale_factor)`, otherwise - the `np.array` will have shape `(batch_size, height // self.vae_scale_factor, width // - self.vae_scale_factor)`. + `List[PIL.Image.Image]` or `np.array`: + When returning a `List[PIL.Image.Image]`, the list consists of a batch of single-channel binary images + with dimensions `(height // self.vae_scale_factor, width // self.vae_scale_factor)`. If it's + `np.array`, the shape is `(batch_size, height // self.vae_scale_factor, width // + self.vae_scale_factor)`. """ # 1. Check inputs (Provide dummy argument for callback_steps) @@ -1069,78 +1064,72 @@ def invert( num_auto_corr_rolls: int = 5, ): r""" - Function used to generate inverted latents given a prompt and image. + Generate inverted latents given a prompt and image. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`PIL.Image.Image`): - `Image`, or tensor representing an image batch to produce the inverted latents, guided by `prompt`. + `Image` or tensor representing an image batch to produce the inverted latents guided by `prompt`. inpaint_strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how far into the noising process to run latent inversion. Must be between 0 and - 1. When `strength` is 1, the inversion process will be run for the full number of iterations specified - in `num_inference_steps`. `image` will be used as a reference for the inversion process, adding more - noise the larger the `strength`. If `strength` is 0, no inpainting will occur. + Indicates extent of the noising process to run latent inversion. Must be between 0 and 1. When + `strength` is 1, the inversion process iss ru for the full number of iterations specified in + `num_inference_steps`. `image` is used as a reference for the inversion process, adding more noise the + larger the `strength`. If `strength` is 0, no inpainting occurs. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. decode_latents (`bool`, *optional*, defaults to `False`): Whether or not to decode the inverted latents into a generated image. Setting this argument to `True` - will decode all inverted latents for each timestep into a list of generated images. + decodes all inverted latents for each timestep into a list of generated images. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.DiffEditInversionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). lambda_auto_corr (`float`, *optional*, defaults to 20.0): - Lambda parameter to control auto correction + Lambda parameter to control auto correction. lambda_kl (`float`, *optional*, defaults to 20.0): - Lambda parameter to control Kullback–Leibler divergence output + Lambda parameter to control Kullback–Leibler divergence output. num_reg_steps (`int`, *optional*, defaults to 0): - Number of regularization loss steps + Number of regularization loss steps. num_auto_corr_rolls (`int`, *optional*, defaults to 5): - Number of auto correction roll steps + Number of auto correction roll steps. Examples: Returns: [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] or - `tuple`: [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] - if `return_dict` is `True`, otherwise a `tuple`. When returning a tuple, the first element is the inverted - latents tensors ordered by increasing noise, and then second is the corresponding decoded images if - `decode_latents` is `True`, otherwise `None`. + `tuple`: + If `return_dict` is `True`, + [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is the inverted latents tensors + ordered by increasing noise, and the second is the corresponding decoded images if `decode_latents` is + `True`, otherwise `None`. """ # 1. Check inputs @@ -1306,81 +1295,73 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. mask_image (`PIL.Image.Image`): - `Image`, or tensor representing an image batch, to mask the generated image. White pixels in the mask - will be repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be - converted to a single channel (luminance) before use. If it's a tensor, it should contain one color - channel (L) instead of 3, so the expected shape would be `(B, 1, H, W)`. + `Image` or tensor representing an image batch to mask the generated image. White pixels in the mask are + repainted, while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a + single channel (luminance) before use. If it's a tensor, it should contain one color channel (L) + instead of 3, so the expected shape would be `(B, 1, H, W)`. image_latents (`PIL.Image.Image` or `torch.FloatTensor`): Partially noised image latents from the inversion process to be used as inputs for image generation. inpaint_strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to inpaint the masked area. Must be between 0 and 1. When `strength` - is 1, the denoising process will be run on the masked area for the full number of iterations specified - in `num_inference_steps`. `image_latents` will be used as a reference for the masked area, adding more - noise to that region the larger the `strength`. If `strength` is 0, no inpainting will occur. + Indicates extent to inpaint the masked area. Must be between 0 and 1. When `strength` is 1, the + denoising process is run on the masked area for the full number of iterations specified in + `num_inference_steps`. `image_latents` is used as a reference for the masked area, adding more noise to + that region the larger the `strength`. If `strength` is 0, no inpainting occurs. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 1. Check inputs diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py index 9b6015a6cb24..6b12719c6f5b 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py @@ -70,10 +70,10 @@ def preprocess(image): class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for pixel-level image editing by following text instructions. Based on Stable Diffusion. + Pipeline for pixel-level image editing by following text instructions (based on Stable Diffusion). - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] @@ -84,23 +84,22 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -174,64 +173,57 @@ def __call__( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): - `Image`, or tensor representing an image batch which will be repainted according to `prompt`. Can also - accpet image latents as `image`, if passing latents directly, it will not be encoded again. + `Image` or tensor representing an image batch to be repainted according to `prompt`. Can also accept + image latents as `image`, if passing latents directly, it will not be encoded again. num_inference_steps (`int`, *optional*, defaults to 100): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. This pipeline requires a value of at least `1`. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. image_guidance_scale (`float`, *optional*, defaults to 1.5): - Image guidance scale is to push the generated image towards the inital image `image`. Image guidance - scale is enabled by setting `image_guidance_scale > 1`. Higher image guidance scale encourages to - generate images that are closely linked to the source image `image`, usually at the expense of lower - image quality. This pipeline requires a value of at least `1`. + Push the generated image towards the inital `image`. Image guidance scale is enabled by setting + `image_guidance_scale > 1`. Higher image guidance scale encourages generated images that are closely + linked to the source `image`, usually at the expense of lower image quality. This pipeline requires a + value of at least `1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` - is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -264,10 +256,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Check inputs self.check_inputs(prompt, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds) diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py index a7911f52710f..a65ce0829048 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py @@ -51,6 +51,8 @@ >>> prompt = "a photo of an astronaut riding a horse on mars" >>> output = pipe(prompt) >>> rgb_image, depth_image = output.rgb, output.depth + >>> rgb_image[0].save("astronaut_ldm3d_rgb.jpg") + >>> depth_image[0].save("astronaut_ldm3d_depth.png") ``` """ diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py index a0cd1444f394..69dcaaff4bad 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py @@ -53,34 +53,36 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for text-to-image generation using "MultiDiffusion: Fusing Diffusion Paths for Controlled Image - Generation". + Pipeline for text-to-image generation using MultiDiffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.). + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). - To generate panorama-like images, be sure to pass the `width` parameter accordingly when using the pipeline. Our - recommendation for the `width` value is 2048. This is the default value of the `width` parameter for this pipeline. + + + To generate panorama-like images make sure you pass the `width` parameter accordingly. We recommend a `width` value + of 2048 which is the default. + + Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded image latents. The original work - on Multi Diffsion used the [`DDIMScheduler`]. + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -468,64 +470,57 @@ def __call__( circular_padding: bool = False, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to 512: + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to 512): The height in pixels of the generated image. width (`int`, *optional*, defaults to 2048): - The width in pixels of the generated image. The width is kept to a high number because the - pipeline is supposed to be used for generating panorama-like images. + The width in pixels of the generated image. The width is kept high because the pipeline is supposed + generate panorama-like images. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. view_batch_size (`int`, *optional*, defaults to 1): The batch size to denoise splited views. For some GPUs with high performance, higher view batch size can speedup the generation and increase the VRAM usage. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in @@ -539,10 +534,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py index ae7cdc4cda42..ce4793743bc1 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py @@ -63,13 +63,10 @@ class StableDiffusionParadigmsPipeline( DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin ): r""" - Parallelized version of StableDiffusionPipeline, based on the paper https://arxiv.org/abs/2305.16317 This pipeline - parallelizes the denoising steps to generate a single image faster (more akin to model parallelism). + Pipeline for text-to-image generation using a parallelized version of Stable Diffusion. - Pipeline for text-to-image generation using Stable Diffusion. - - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). In addition the pipeline inherits the following loading methods: - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] @@ -81,23 +78,22 @@ class StableDiffusionParadigmsPipeline( Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -496,82 +492,74 @@ def __call__( debug: bool = False, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. parallel (`int`, *optional*, defaults to 10): The batch size to use when doing parallel sampling. More parallelism may lead to faster inference but - requires higher memory usage and also can require more total FLOPs. + requires higher memory usage and can also require more total FLOPs. tolerance (`float`, *optional*, defaults to 0.1): The error tolerance for determining when to slide the batch window forward for parallel sampling. Lower - tolerance usually leads to less/no degradation. Higher tolerance is faster but can risk degradation of - sample quality. The tolerance is specified as a ratio of the scheduler's noise magnitude. + tolerance usually leads to less or no degradation. Higher tolerance is faster but can risk degradation + of sample quality. The tolerance is specified as a ratio of the scheduler's noise magnitude. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). debug (`bool`, *optional*, defaults to `False`): - Whether or not to run in debug mode. In debug mode, torch.cumsum is evaluated using the CPU. + Whether or not to run in debug mode. In debug mode, `torch.cumsum` is evaluated using the CPU. Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor From d421408997527a478c82615e639fc0cf9e015b88 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 7 Jul 2023 15:30:47 -0700 Subject: [PATCH 08/13] finish first pass of pipelines --- .../api/pipelines/latent_diffusion_uncond.mdx | 25 +-- .../source/en/api/pipelines/model_editing.mdx | 46 +--- .../api/pipelines/self_attention_guidance.mdx | 50 +---- .../pipelines/semantic_stable_diffusion.mdx | 64 +----- docs/source/en/api/pipelines/shap_e.mdx | 26 +-- .../api/pipelines/spectrogram_diffusion.mdx | 37 +--- .../source/en/api/pipelines/stable_unclip.mdx | 72 +------ .../en/api/pipelines/stochastic_karras_ve.mdx | 21 +- .../source/en/api/pipelines/text_to_video.mdx | 55 +---- .../en/api/pipelines/text_to_video_zero.mdx | 41 ++-- docs/source/en/api/pipelines/unclip.mdx | 24 +-- docs/source/en/api/pipelines/unidiffuser.mdx | 26 +-- .../en/api/pipelines/versatile_diffusion.mdx | 42 +--- docs/source/en/api/pipelines/vq_diffusion.mdx | 22 +- .../pipeline_latent_diffusion.py | 5 +- .../pipeline_latent_diffusion_uncond.py | 42 ++-- .../semantic_stable_diffusion/__init__.py | 8 +- .../pipeline_semantic_stable_diffusion.py | 197 +++++++++--------- .../pipelines/shap_e/pipeline_shap_e.py | 48 ++--- .../shap_e/pipeline_shap_e_img2img.py | 53 ++--- .../pipeline_spectrogram_diffusion.py | 57 +++++ ...pipeline_stable_diffusion_model_editing.py | 136 ++++++------ .../pipeline_stable_diffusion_sag.py | 93 ++++----- .../pipeline_stable_unclip.py | 114 +++++----- .../pipeline_stable_unclip_img2img.py | 109 +++++----- .../pipeline_stochastic_karras_ve.py | 34 +-- .../text_to_video_synthesis/__init__.py | 5 +- .../pipeline_text_to_video_synth.py | 89 ++++---- .../pipeline_text_to_video_synth_img2img.py | 102 +++++---- .../pipeline_text_to_video_zero.py | 146 +++++++------ .../pipelines/unclip/pipeline_unclip.py | 59 +++--- .../unclip/pipeline_unclip_image_variation.py | 53 +++-- .../unidiffuser/pipeline_unidiffuser.py | 126 +++++------ .../pipeline_versatile_diffusion.py | 167 +++++++-------- ...ipeline_versatile_diffusion_dual_guided.py | 66 +++--- ...ine_versatile_diffusion_image_variation.py | 66 +++--- ...eline_versatile_diffusion_text_to_image.py | 68 +++--- .../vq_diffusion/pipeline_vq_diffusion.py | 54 +++-- 38 files changed, 1055 insertions(+), 1393 deletions(-) diff --git a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx index c293ebb9400e..fe54056577cd 100644 --- a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx +++ b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx @@ -12,31 +12,18 @@ specific language governing permissions and limitations under the License. # Unconditional Latent Diffusion -## Overview +Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. -Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -The abstract of the paper is the following: +The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.* -The original codebase can be found [here](https://github.com/CompVis/latent-diffusion). - -## Tips: - -- -- -- - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_latent_diffusion_uncond.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py) | *Unconditional Image Generation* | - | - -## Examples: +The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). ## LDMPipeline [[autodoc]] LDMPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/model_editing.mdx b/docs/source/en/api/pipelines/model_editing.mdx index 7aae35ba2a91..e33bb43584a1 100644 --- a/docs/source/en/api/pipelines/model_editing.mdx +++ b/docs/source/en/api/pipelines/model_editing.mdx @@ -10,52 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Editing Implicit Assumptions in Text-to-Image Diffusion Models +# Text-to-Image Model Editing -## Overview +[Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://huggingface.co/papers/2303.08084) is by Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. This pipeline enables editing diffusion model weights, such that its assumptions of a given concept are changed. The resulting change is expected to take effect in all prompt generations related to the edited concept. -[Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://arxiv.org/abs/2303.08084) by Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. - -The abstract of the paper is the following: +The abstract from the paper is: *Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.* -Resources: - -* [Project Page](https://time-diffusion.github.io/). -* [Paper](https://arxiv.org/abs/2303.08084). -* [Original Code](https://github.com/bahjat-kawar/time-diffusion). -* [Demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion). - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionModelEditingPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py) | *Text-to-Image Model Editing* | [🤗 Space](https://huggingface.co/spaces/bahjat-kawar/time-diffusion)) | - -This pipeline enables editing the diffusion model weights, such that its assumptions on a given concept are changed. The resulting change is expected to take effect in all prompt generations pertaining to the edited concept. - -## Usage example - -```python -import torch -from diffusers import StableDiffusionModelEditingPipeline - -model_ckpt = "CompVis/stable-diffusion-v1-4" -pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt) - -pipe = pipe.to("cuda") - -source_prompt = "A pack of roses" -destination_prompt = "A pack of blue roses" -pipe.edit_model(source_prompt, destination_prompt) - -prompt = "A field of roses" -image = pipe(prompt).images[0] -image.save("field_of_roses.png") -``` +You can find additional information about model editing on the [project page](https://time-diffusion.github.io/), [paper](https://arxiv.org/abs/2303.08084), [original codebase](https://github.com/bahjat-kawar/time-diffusion), and try it out in a [demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion). ## StableDiffusionModelEditingPipeline [[autodoc]] StableDiffusionModelEditingPipeline - __call__ - all + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/self_attention_guidance.mdx b/docs/source/en/api/pipelines/self_attention_guidance.mdx index 133f2b775d71..30718e69e9fc 100644 --- a/docs/source/en/api/pipelines/self_attention_guidance.mdx +++ b/docs/source/en/api/pipelines/self_attention_guidance.mdx @@ -10,56 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Self-Attention Guidance (SAG) +# Self-Attention Guidance -## Overview +[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al. -[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al. - -The abstract of the paper is the following: +The abstract from the paper is: *Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.* -Resources: - -* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance). -* [Paper](https://arxiv.org/abs/2210.00939). -* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance). -* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance). -* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). - - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) | - -## Usage example - -```python -import torch -from diffusers import StableDiffusionSAGPipeline -from accelerate.utils import set_seed - -pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16) -pipe = pipe.to("cuda") - -seed = 8978 -prompt = "." -guidance_scale = 7.5 -num_images_per_prompt = 1 - -sag_scale = 1.0 - -set_seed(seed) -images = pipe( - prompt, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale, sag_scale=sag_scale -).images -images[0].save("example.png") -``` +You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [paper](https://arxiv.org/abs/2210.00939), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). ## StableDiffusionSAGPipeline [[autodoc]] StableDiffusionSAGPipeline - __call__ - all + +## StableDiffusionOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx b/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx index b4562cf0c389..3b0898b72b48 100644 --- a/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx +++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx @@ -12,68 +12,18 @@ specific language governing permissions and limitations under the License. # Semantic Guidance -Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://arxiv.org/abs/2301.12247) and provides strong semantic control over the image generation. -Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, and stay true to the original image composition. +Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. +Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition. -The abstract of the paper is the following: +The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* - -*Overview*: - -| Pipeline | Tasks | Colab | Demo -|---|---|:---:|:---:| -| [pipeline_semantic_stable_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/semantic-image-editing/blob/main/examples/SemanticGuidance.ipynb) | [Coming Soon](https://huggingface.co/AIML-TUDA) - -## Tips - -- The Semantic Guidance pipeline can be used with any [Stable Diffusion](./stable_diffusion/text2img) checkpoint. - -### Run Semantic Guidance - -The interface of [`SemanticStableDiffusionPipeline`] provides several additional parameters to influence the image generation. -Exemplary usage may look like this: - -```python -import torch -from diffusers import SemanticStableDiffusionPipeline - -pipe = SemanticStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16) -pipe = pipe.to("cuda") - -out = pipe( - prompt="a photo of the face of a woman", - num_images_per_prompt=1, - guidance_scale=7, - editing_prompt=[ - "smiling, smile", # Concepts to apply - "glasses, wearing glasses", - "curls, wavy hair, curly hair", - "beard, full beard, mustache", - ], - reverse_editing_direction=[False, False, False, False], # Direction of guidance i.e. increase all concepts - edit_warmup_steps=[10, 10, 10, 10], # Warmup period for each concept - edit_guidance_scale=[4, 5, 5, 5.4], # Guidance scale for each concept - edit_threshold=[ - 0.99, - 0.975, - 0.925, - 0.96, - ], # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions - edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance - edit_mom_beta=0.6, # Momentum beta - edit_weights=[1, 1, 1, 1, 1], # Weights of the individual concepts against each other -) -``` - -For more examples check the Colab notebook. - -## StableDiffusionSafePipelineOutput -[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput - - all - ## SemanticStableDiffusionPipeline [[autodoc]] SemanticStableDiffusionPipeline - all - __call__ + +## StableDiffusionSafePipelineOutput +[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput + - all \ No newline at end of file diff --git a/docs/source/en/api/pipelines/shap_e.mdx b/docs/source/en/api/pipelines/shap_e.mdx index 2eec12e6a679..bb971ac869e6 100644 --- a/docs/source/en/api/pipelines/shap_e.mdx +++ b/docs/source/en/api/pipelines/shap_e.mdx @@ -9,28 +9,13 @@ specific language governing permissions and limitations under the License. # Shap-E -## Overview +The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai). - -The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai). - -The abstract of the paper is the following: +The abstract from the paper is: *We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.* -The original codebase can be found [here](https://github.com/openai/shap-e). - -## Available Pipelines: - -| Pipeline | Tasks | -|---|---| -| [pipeline_shap_e.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e.py) | *Text-to-Image Generation* | -| [pipeline_shap_e_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py) | *Image-to-Image Generation* | - -## Available checkpoints - -* [`openai/shap-e`](https://huggingface.co/openai/shap-e) -* [`openai/shap-e-img2img`](https://huggingface.co/openai/shap-e-img2img) +The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e). ## Usage Examples @@ -193,4 +178,7 @@ https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/s ## ShapEImg2ImgPipeline [[autodoc]] ShapEImg2ImgPipeline - all - - __call__ \ No newline at end of file + - __call__ + +## ShapEPipelineOutput +[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/spectrogram_diffusion.mdx b/docs/source/en/api/pipelines/spectrogram_diffusion.mdx index 728c6b3aa2f2..bf662585e099 100644 --- a/docs/source/en/api/pipelines/spectrogram_diffusion.mdx +++ b/docs/source/en/api/pipelines/spectrogram_diffusion.mdx @@ -10,45 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Multi-instrument Music Synthesis with Spectrogram Diffusion +# Spectrogram Diffusion -## Overview +[Spectrogram Diffusion](https://huggingface.co/papers/2206.05408) is by Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, and Jesse Engel. -[Spectrogram Diffusion](https://arxiv.org/abs/2206.05408) by Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, and Jesse Engel. +*An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.* -An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes. - -The original codebase of this implementation can be found at [magenta/music-spectrogram-diffusion](https://github.com/magenta/music-spectrogram-diffusion). - -## Model +The original codebase can be found at [magenta/music-spectrogram-diffusion](https://github.com/magenta/music-spectrogram-diffusion). ![img](https://storage.googleapis.com/music-synthesis-with-spectrogram-diffusion/architecture.png) As depicted above the model takes as input a MIDI file and tokenizes it into a sequence of 5 second intervals. Each tokenized interval then together with positional encodings is passed through the Note Encoder and its representation is concatenated with the previous window's generated spectrogram representation obtained via the Context Encoder. For the initial 5 second window this is set to zero. The resulting context is then used as conditioning to sample the denoised Spectrogram from the MIDI window and we concatenate this spectrogram to the final output as well as use it for the context of the next MIDI window. The process repeats till we have gone over all the MIDI inputs. Finally a MelGAN decoder converts the potentially long spectrogram to audio which is the final result of this pipeline. -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_spectrogram_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py) | *Unconditional Audio Generation* | - | - - -## Example usage - -```python -from diffusers import SpectrogramDiffusionPipeline, MidiProcessor - -pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion") -pipe = pipe.to("cuda") -processor = MidiProcessor() - -# Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid -output = pipe(processor("beethoven_hammerklavier_2.mid")) - -audio = output.audios[0] -``` - ## SpectrogramDiffusionPipeline [[autodoc]] SpectrogramDiffusionPipeline - all - __call__ + +## AudioPipelineOutput +[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_unclip.mdx b/docs/source/en/api/pipelines/stable_unclip.mdx index ee359d0ba486..739d357ddcdf 100644 --- a/docs/source/en/api/pipelines/stable_unclip.mdx +++ b/docs/source/en/api/pipelines/stable_unclip.mdx @@ -12,27 +12,19 @@ specific language governing permissions and limitations under the License. # Stable unCLIP -Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings. -Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used +Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings. +Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. -To know more about the unCLIP process, check out the following paper: +The abstract from the paper is: -[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. +*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* ## Tips -Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added +Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, -we do not add any additional noise to the image embeddings i.e. `noise_level = 0`. - -### Available checkpoints: - -* Image variation - * [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) - * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) -* Text-to-image - * [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small) +we do not add any additional noise to the image embeddings (`noise_level = 0`). ### Text-to-Image Generation Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) @@ -104,51 +96,7 @@ prompt = "A fantasy landscape, trending on artstation" images = pipe(init_image, prompt=prompt).images images[0].save("variation_image_two.png") ``` - -### Memory optimization - -If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed -immediately for a computation can be offloaded to CPU: - -```python -from diffusers import StableUnCLIPImg2ImgPipeline -from diffusers.utils import load_image -import torch - -pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" -) -# Offload to CPU. -pipe.enable_model_cpu_offload() - -url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" -init_image = load_image(url) - -images = pipe(init_image).images -images[0] -``` - -Further memory optimizations are possible by enabling VAE slicing on the pipeline: - -```python -from diffusers import StableUnCLIPImg2ImgPipeline -from diffusers.utils import load_image -import torch - -pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" -) -pipe.enable_model_cpu_offload() -pipe.enable_vae_slicing() - -url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" -init_image = load_image(url) - -images = pipe(init_image).images -images[0] -``` - -### StableUnCLIPPipeline +## StableUnCLIPPipeline [[autodoc]] StableUnCLIPPipeline - all @@ -161,7 +109,7 @@ images[0] - disable_xformers_memory_efficient_attention -### StableUnCLIPImg2ImgPipeline +## StableUnCLIPImg2ImgPipeline [[autodoc]] StableUnCLIPImg2ImgPipeline - all @@ -172,4 +120,6 @@ images[0] - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - \ No newline at end of file + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx index 17a414303b9c..40eb5bb51733 100644 --- a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx +++ b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx @@ -12,25 +12,16 @@ specific language governing permissions and limitations under the License. # Stochastic Karras VE -## Overview +[Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) is by Tero Karras, Miika Aittala, Timo Aila and Samuli Laine. This pipeline implements the stochastic sampling tailored to variance expanding (VE) models. -[Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) by Tero Karras, Miika Aittala, Timo Aila and Samuli Laine. - -The abstract of the paper is the following: - -We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55. - -This pipeline implements the Stochastic sampling tailored to the Variance-Expanding (VE) models. - - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_stochastic_karras_ve.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py) | *Unconditional Image Generation* | - | +The abstract from the paper: +*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.* ## KarrasVePipeline [[autodoc]] KarrasVePipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/text_to_video.mdx b/docs/source/en/api/pipelines/text_to_video.mdx index 583d461ea948..319a9cca16ff 100644 --- a/docs/source/en/api/pipelines/text_to_video.mdx +++ b/docs/source/en/api/pipelines/text_to_video.mdx @@ -12,32 +12,19 @@ specific language governing permissions and limitations under the License. -This pipeline is for research purposes only. +🧪 This pipeline is for research purposes only. -# Text-to-video synthesis +# Text-to-Video -## Overview +[VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://huggingface.co/papers/2303.08320) is by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan. -[VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://arxiv.org/abs/2303.08320) by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan. - -The abstract of the paper is the following: +The abstract from the paper is: *A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.* -Resources: - -* [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) -* [GitHub repository](https://github.com/modelscope/modelscope/) -* [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis) - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis) -| [VideoToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py) | *Text-Guided Video-to-Video Generation* | [(TODO)🤗 Spaces]() +You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense). ## Usage example @@ -179,35 +166,6 @@ Here are some sample outputs: -### Memory optimizations - -Text-guided video generation with [`~TextToVideoSDPipeline`] and [`~VideoToVideoSDPipeline`] is very memory intensive both -when denoising with [`~UNet3DConditionModel`] and when decoding with [`~AutoencoderKL`]. It is possible though to reduce -memory usage at the cost of increased runtime to achieve the exact same result. To do so, it is recommended to enable -**forward chunking** and **vae slicing**: - -Forward chunking via [`~UNet3DConditionModel.enable_forward_chunking`]is explained in [this blog post](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers) and -allows to significantly reduce the required memory for the unet. You can chunk the feed forward layer over the `num_frames` -dimension by doing: - -```py -pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) -``` - -Vae slicing via [`~TextToVideoSDPipeline.enable_vae_slicing`] and [`~VideoToVideoSDPipeline.enable_vae_slicing`] also -gives significant memory savings since the two pipelines decode all image frames at once. - -```py -pipe.enable_vae_slicing() -``` - -## Available checkpoints - -* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/) -* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy) -* [cerspense/zeroscope_v2_576w](https://huggingface.co/cerspense/zeroscope_v2_576w) -* [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) - ## TextToVideoSDPipeline [[autodoc]] TextToVideoSDPipeline - all @@ -217,3 +175,6 @@ pipe.enable_vae_slicing() [[autodoc]] VideoToVideoSDPipeline - all - __call__ + +## TextToVideoSDPipelineOutput +[[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/text_to_video_zero.mdx b/docs/source/en/api/pipelines/text_to_video_zero.mdx index 3c3dcf5bb1ad..f5b4ace56c9f 100644 --- a/docs/source/en/api/pipelines/text_to_video_zero.mdx +++ b/docs/source/en/api/pipelines/text_to_video_zero.mdx @@ -10,49 +10,32 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Zero-Shot Text-to-Video Generation +# Text-to-Video Zero -## Overview - - -[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) by +[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). -Our method Text2Video-Zero enables zero-shot video generation using either -1. A textual prompt, or -2. A prompt combined with guidance from poses or edges, or -3. Video Instruct-Pix2Pix, i.e., instruction-guided video editing. +Text2Video-Zero enables zero-shot video generation using either: +1. A textual prompt +2. A prompt combined with guidance from poses or edges +3. Video Instruct-Pix2Pix (instruction-guided video editing) -Results are temporally consistent and follow closely the guidance and textual prompts. +Results are temporally consistent and closely follow the guidance and textual prompts. ![teaser-img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/t2v_zero_teaser.png) -The abstract of the paper is the following: +The abstract from the paper is: *Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.* - - -Resources: - -* [Project Page](https://text2video-zero.github.io/) -* [Paper](https://arxiv.org/abs/2303.13439) -* [Original Code](https://github.com/Picsart-AI-Research/Text2Video-Zero) - - -## Available Pipelines: - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [TextToVideoZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py) | *Zero-shot Text-to-Video Generation* | [🤗 Space](https://huggingface.co/spaces/PAIR/Text2Video-Zero) - +You can find additional information about Text-to-Video Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). ## Usage example @@ -268,8 +251,10 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). - ## TextToVideoZeroPipeline [[autodoc]] TextToVideoZeroPipeline - all - - __call__ \ No newline at end of file + - __call__ + +## TextToVideoPipelineOutput +[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/unclip.mdx b/docs/source/en/api/pipelines/unclip.mdx index 13a578a0ab48..8d536d768447 100644 --- a/docs/source/en/api/pipelines/unclip.mdx +++ b/docs/source/en/api/pipelines/unclip.mdx @@ -7,31 +7,25 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# unCLIP +# UnCLIP -## Overview +[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)). -[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen +The abstract from the paper is following: -The abstract of the paper is the following: - -Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples. - -The unCLIP model in diffusers comes from kakaobrain's karlo and the original codebase can be found [here](https://github.com/kakaobrain/karlo). Additionally, lucidrains has a DALL-E 2 recreation [here](https://github.com/lucidrains/DALLE2-pytorch). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_unclip.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/unclip/pipeline_unclip.py) | *Text-to-Image Generation* | - | -| [pipeline_unclip_image_variation.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py) | *Image-Guided Image Generation* | - | +*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* +You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). ## UnCLIPPipeline [[autodoc]] UnCLIPPipeline - all - __call__ +## UnCLIPImageVariationPipeline [[autodoc]] UnCLIPImageVariationPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/unidiffuser.mdx b/docs/source/en/api/pipelines/unidiffuser.mdx index 10290e263e6d..ff8f4e7c6ec9 100644 --- a/docs/source/en/api/pipelines/unidiffuser.mdx +++ b/docs/source/en/api/pipelines/unidiffuser.mdx @@ -12,32 +12,19 @@ specific language governing permissions and limitations under the License. # UniDiffuser -The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. +The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. -The abstract of the [paper](https://arxiv.org/abs/2303.06555) is the following: +The abstract from the [paper](https://arxiv.org/abs/2303.06555) is: *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).* -Resources: +You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml). -* [Paper](https://arxiv.org/abs/2303.06555). -* [Original Code](https://github.com/thu-ml/unidiffuser). - -Available Checkpoints are: -- *UniDiffuser-v0 (512x512 resolution)* [thu-ml/unidiffuser-v0](https://huggingface.co/thu-ml/unidiffuser-v0) -- *UniDiffuser-v1 (512x512 resolution)* [thu-ml/unidiffuser-v1](https://huggingface.co/thu-ml/unidiffuser-v1) - -This pipeline was contributed by our community member [dg845](https://github.com/dg845). - -## Available Pipelines: - -| Pipeline | Tasks | Demo | Colab | -|:---:|:---:|:---:|:---:| -| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*,
*Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | [🤗 Spaces](https://huggingface.co/spaces/thu-ml/unidiffuser) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/unidiffuser.ipynb) | +This pipeline was contributed by [dg845](https://github.com/dg845). ❤️ ## Usage Examples -Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks. +Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks: ### Unconditional Image and Text Generation @@ -202,3 +189,6 @@ print(final_prompt) [[autodoc]] UniDiffuserPipeline - all - __call__ + +## ImageTextPipelineOutput +[[autodoc]] pipelines.ImageTextPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/versatile_diffusion.mdx b/docs/source/en/api/pipelines/versatile_diffusion.mdx index f87fdc93e36e..51bf953e9de7 100644 --- a/docs/source/en/api/pipelines/versatile_diffusion.mdx +++ b/docs/source/en/api/pipelines/versatile_diffusion.mdx @@ -10,46 +10,24 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# VersatileDiffusion +# Versatile Diffusion -VersatileDiffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi . +Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi . -The abstract of the paper is the following: +The abstract from the paper is: *The recent advances in diffusion models have set an impressive milestone in many generation tasks. Trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest in academia and industry. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-flow network, dubbed Versatile Diffusion (VD), that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model. Moreover, we generalize VD to a unified multi-flow multimodal diffusion framework with grouped layers, swappable streams, and other propositions that can process modalities beyond images and text. Through our experiments, we demonstrate that VD and its underlying framework have the following merits: a) VD handles all subtasks with competitive quality; b) VD initiates novel extensions and applications such as disentanglement of style and semantic, image-text dual-guided generation, etc.; c) Through these experiments and applications, VD provides more semantic insights of the generated outputs.* ## Tips -- VersatileDiffusion is conceptually very similar as [Stable Diffusion](./stable_diffusion/overview), but instead of providing just a image data stream conditioned on text, VersatileDiffusion provides both a image and text data stream and can be conditioned on both text and image. +You can load the more memory intensive "all-in-one" [`VersatileDiffusionPipeline`] that supports all the tasks or use the individual pipelines which are more memory efficient. -### *Run VersatileDiffusion* - -You can both load the memory intensive "all-in-one" [`VersatileDiffusionPipeline`] that can run all tasks -with the same class as shown in [`VersatileDiffusionPipeline.text_to_image`], [`VersatileDiffusionPipeline.image_variation`], and [`VersatileDiffusionPipeline.dual_guided`] - -**or** - -You can run the individual pipelines which are much more memory efficient: - -- *Text-to-Image*: [`VersatileDiffusionTextToImagePipeline.__call__`] -- *Image Variation*: [`VersatileDiffusionImageVariationPipeline.__call__`] -- *Dual Text and Image Guided Generation*: [`VersatileDiffusionDualGuidedPipeline.__call__`] - -### *How to load and use different schedulers.* - -The versatile diffusion pipelines uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the alt diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc. -To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following: - -```python ->>> from diffusers import VersatileDiffusionPipeline, EulerDiscreteScheduler - ->>> pipeline = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion") ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) - ->>> # or ->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("shi-labs/versatile-diffusion", subfolder="scheduler") ->>> pipeline = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", scheduler=euler_scheduler) -``` +| **Pipeline** | **Supported tasks** | +|--------------------------------------------|-----------------------------------| +| `VersatileDiffusion` | all of the below | +| `VersatileDiffusionTextToImagePipeline` | text-to-image | +| `VersatileDiffusionImageVariationPipeline` | image variation | +| `VersatileDiffusionDualGuidedPipeline` | image-text dual guided generation | ## VersatileDiffusionPipeline [[autodoc]] VersatileDiffusionPipeline diff --git a/docs/source/en/api/pipelines/vq_diffusion.mdx b/docs/source/en/api/pipelines/vq_diffusion.mdx index f8182c674f7a..66614c5b177b 100644 --- a/docs/source/en/api/pipelines/vq_diffusion.mdx +++ b/docs/source/en/api/pipelines/vq_diffusion.mdx @@ -10,26 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# VQDiffusion +# VQ Diffusion -## Overview +[Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) is by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo. -[Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo +The abstract from the paper is: -The abstract of the paper is the following: - -We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality. - -The original codebase can be found [here](https://github.com/microsoft/VQ-Diffusion). - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_vq_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py) | *Text-to-Image Generation* | - | +*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.* +The original codebase can be found at [microsoft/VQ-Diffusion](https://github.com/microsoft/VQ-Diffusion). ## VQDiffusionPipeline [[autodoc]] VQDiffusionPipeline - all - __call__ + +## ImagePipelineOutput +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py index cfe620d9c2e6..46d7868b6936 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py @@ -93,7 +93,8 @@ def __call__( guidance_scale (`float`, *optional*, defaults to 1.0): A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. - A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generator (`torch.Generator`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic. latents (`torch.FloatTensor`, *optional*): Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image @@ -125,7 +126,7 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is - returned where the first element is a list with the generated images + returned where the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py b/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py index 73c607a27187..b58c466cfb60 100644 --- a/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py +++ b/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py @@ -25,15 +25,18 @@ class LDMPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for unconditional image generation using latent diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) Model to encode and decode images to and from latent representations. - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image latents. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - [`DDIMScheduler`] is to be used in combination with `unet` to denoise the encoded image latents. + [`DDIMScheduler`] is used in combination with `unet` to denoise the encoded image latents. """ def __init__(self, vqvae: VQModel, unet: UNet2DModel, scheduler: DDIMScheduler): @@ -52,24 +55,39 @@ def __call__( **kwargs, ) -> Union[Tuple, ImagePipelineOutput]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): Number of images to generate. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + >>> # !pip install diffusers transformers + >>> from diffusers import LDMPipeline + + >>> # load model and scheduler + >>> pipe = LDMPipeline.from_pretrained("CompVis/ldm-celebahq-256") + + >>> # run pipeline in inference (sample random noise and denoise) + >>> image = pipe().images[0] + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images """ latents = randn_tensor( diff --git a/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py b/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py index 0e312c5e3013..7c961cc53b83 100644 --- a/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py +++ b/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py @@ -16,11 +16,11 @@ class SemanticStableDiffusionPipelineOutput(BaseOutput): Args: images (`List[PIL.Image.Image]` or `np.ndarray`) - List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, - num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. + List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, + num_channels)`. nsfw_content_detected (`List[bool]`) - List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, or `None` if safety checking could not be performed. + List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content or + None if safety checking could not be performed. """ images: Union[List[PIL.Image.Image], np.ndarray] diff --git a/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py b/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py index 911a5018de18..ec5eebad4cc0 100644 --- a/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py +++ b/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py @@ -16,78 +16,34 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name -EXAMPLE_DOC_STRING = """ - Examples: - ```py - >>> import torch - >>> from diffusers import SemanticStableDiffusionPipeline - - >>> pipe = SemanticStableDiffusionPipeline.from_pretrained( - ... "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 - ... ) - >>> pipe = pipe.to("cuda") - - >>> out = pipe( - ... prompt="a photo of the face of a woman", - ... num_images_per_prompt=1, - ... guidance_scale=7, - ... editing_prompt=[ - ... "smiling, smile", # Concepts to apply - ... "glasses, wearing glasses", - ... "curls, wavy hair, curly hair", - ... "beard, full beard, mustache", - ... ], - ... reverse_editing_direction=[ - ... False, - ... False, - ... False, - ... False, - ... ], # Direction of guidance i.e. increase all concepts - ... edit_warmup_steps=[10, 10, 10, 10], # Warmup period for each concept - ... edit_guidance_scale=[4, 5, 5, 5.4], # Guidance scale for each concept - ... edit_threshold=[ - ... 0.99, - ... 0.975, - ... 0.925, - ... 0.96, - ... ], # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions - ... edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance - ... edit_mom_beta=0.6, # Momentum beta - ... edit_weights=[1, 1, 1, 1, 1], # Weights of the individual concepts against each other - ... ) - >>> image = out.images[0] - ``` -""" class SemanticStableDiffusionPipeline(DiffusionPipeline): r""" - Pipeline for text-to-image generation with latent editing. + Pipeline for text-to-image generation using Stable Diffusion with latent editing. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) - - This model builds on the implementation of ['StableDiffusionPipeline'] + This model inherits from [`DiffusionPipeline`] and builds on the [`StableDiffusionPipeline`]. Check the superclass + documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular + device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`Q16SafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -277,53 +233,49 @@ def __call__( sem_guidance: Optional[List[torch.Tensor]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. editing_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to use for Semantic guidance. Semantic guidance is disabled by setting + The prompt or prompts to use for semantic guidance. Semantic guidance is disabled by setting `editing_prompt = None`. Guidance direction of prompt should be specified via `reverse_editing_direction`. editing_prompt_embeddings (`torch.Tensor>`, *optional*): @@ -332,42 +284,79 @@ def __call__( reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`): Whether the corresponding prompt in `editing_prompt` should be increased or decreased. edit_guidance_scale (`float` or `List[float]`, *optional*, defaults to 5): - Guidance scale for semantic guidance. If provided as list values should correspond to `editing_prompt`. - `edit_guidance_scale` is defined as `s_e` of equation 6 of [SEGA - Paper](https://arxiv.org/pdf/2301.12247.pdf). + Guidance scale for semantic guidance. If provided as a list, values should correspond to + `editing_prompt`. edit_warmup_steps (`float` or `List[float]`, *optional*, defaults to 10): - Number of diffusion steps (for each prompt) for which semantic guidance will not be applied. Momentum - will still be calculated for those steps and applied once all warmup periods are over. - `edit_warmup_steps` is defined as `delta` (δ) of [SEGA Paper](https://arxiv.org/pdf/2301.12247.pdf). + Number of diffusion steps (for each prompt) for which semantic guidance is not applied. Momentum is + calculated for those steps and applied once all warmup periods are over. edit_cooldown_steps (`float` or `List[float]`, *optional*, defaults to `None`): - Number of diffusion steps (for each prompt) after which semantic guidance will no longer be applied. + Number of diffusion steps (for each prompt) after which semantic guidance is longer applied. edit_threshold (`float` or `List[float]`, *optional*, defaults to 0.9): Threshold of semantic guidance. edit_momentum_scale (`float`, *optional*, defaults to 0.1): - Scale of the momentum to be added to the semantic guidance at each diffusion step. If set to 0.0 - momentum will be disabled. Momentum is already built up during warmup, i.e. for diffusion steps smaller - than `sld_warmup_steps`. Momentum will only be added to latent guidance once all warmup periods are - finished. `edit_momentum_scale` is defined as `s_m` of equation 7 of [SEGA - Paper](https://arxiv.org/pdf/2301.12247.pdf). + Scale of the momentum to be added to the semantic guidance at each diffusion step. If set to 0.0, + momentum is disabled. Momentum is already built up during warmup (for diffusion steps smaller than + `sld_warmup_steps`). Momentum is only added to latent guidance once all warmup periods are finished. edit_mom_beta (`float`, *optional*, defaults to 0.4): Defines how semantic guidance momentum builds up. `edit_mom_beta` indicates how much of the previous - momentum will be kept. Momentum is already built up during warmup, i.e. for diffusion steps smaller - than `edit_warmup_steps`. `edit_mom_beta` is defined as `beta_m` (β) of equation 8 of [SEGA - Paper](https://arxiv.org/pdf/2301.12247.pdf). + momentum is kept. Momentum is already built up during warmup (for diffusion steps smaller than + `edit_warmup_steps`). edit_weights (`List[float]`, *optional*, defaults to `None`): Indicates how much each individual concept should influence the overall guidance. If no weights are - provided all concepts are applied equally. `edit_mom_beta` is defined as `g_i` of equation 9 of [SEGA - Paper](https://arxiv.org/pdf/2301.12247.pdf). + provided all concepts are applied equally. sem_guidance (`List[torch.Tensor]`, *optional*): List of pre-generated guidance vectors to be applied at generation. Length of the list has to correspond to `num_inference_steps`. + Examples: + + ```py + >>> import torch + >>> from diffusers import SemanticStableDiffusionPipeline + + >>> pipe = SemanticStableDiffusionPipeline.from_pretrained( + ... "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 + ... ) + >>> pipe = pipe.to("cuda") + + >>> out = pipe( + ... prompt="a photo of the face of a woman", + ... num_images_per_prompt=1, + ... guidance_scale=7, + ... editing_prompt=[ + ... "smiling, smile", # Concepts to apply + ... "glasses, wearing glasses", + ... "curls, wavy hair, curly hair", + ... "beard, full beard, mustache", + ... ], + ... reverse_editing_direction=[ + ... False, + ... False, + ... False, + ... False, + ... ], # Direction of guidance i.e. increase all concepts + ... edit_warmup_steps=[10, 10, 10, 10], # Warmup period for each concept + ... edit_guidance_scale=[4, 5, 5, 5.4], # Guidance scale for each concept + ... edit_threshold=[ + ... 0.99, + ... 0.975, + ... 0.925, + ... 0.96, + ... ], # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions + ... edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance + ... edit_mom_beta=0.6, # Momentum beta + ... edit_weights=[1, 1, 1, 1, 1], # Weights of the individual concepts against each other + ... ) + >>> image = out.images[0] + ``` + Returns: [`~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput`] if `return_dict` is True, - otherwise a `tuple. When returning a tuple, the first element is a list with the generated images, and the - second element is a list of `bool`s denoting whether the corresponding generated image likely represents - "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, + [`~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput`] is returned, otherwise a + `tuple` is returned where the first element is a list with the generated images and the second element + is a list of `bool`s indicating whether the corresponding generated image contains "not-safe-for-work" + (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/shap_e/pipeline_shap_e.py b/src/diffusers/pipelines/shap_e/pipeline_shap_e.py index d93047ec665c..0e6942a7ebb0 100644 --- a/src/diffusers/pipelines/shap_e/pipeline_shap_e.py +++ b/src/diffusers/pipelines/shap_e/pipeline_shap_e.py @@ -68,11 +68,11 @@ @dataclass class ShapEPipelineOutput(BaseOutput): """ - Output class for ShapEPipeline. + Output class for [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`]. Args: images (`torch.FloatTensor`) - a list of images for 3D rendering + A list of images for 3D rendering. """ images: Union[List[List[PIL.Image.Image]], List[List[np.ndarray]]] @@ -80,10 +80,10 @@ class ShapEPipelineOutput(BaseOutput): class ShapEPipeline(DiffusionPipeline): """ - Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E + Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: prior ([`PriorTransformer`]): @@ -91,13 +91,12 @@ class ShapEPipeline(DiffusionPipeline): text_encoder ([`CLIPTextModelWithProjection`]): Frozen text-encoder. tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + A [`~transformers.CLIPTokenizer`] to tokenize text. scheduler ([`HeunDiscreteScheduler`]): A scheduler to be used in combination with `prior` to generate image embedding. shap_e_renderer ([`ShapERenderer`]): Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects - with the NeRF rendering method + with the NeRF rendering method. """ def __init__( @@ -132,10 +131,10 @@ def prepare_latents(self, shape, dtype, device, generator, latents, scheduler): def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -222,7 +221,7 @@ def __call__( return_dict: bool = True, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): @@ -233,30 +232,31 @@ def __call__( The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. usually at the expense of lower image quality. frame_size (`int`, *optional*, default to 64): - the width and height of each image frame of the generated 3d output + The width and height of each image frame of the generated 3D output. output_type (`str`, *optional*, defaults to `"pt"`): The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` (`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]). return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain + tuple. Examples: Returns: - [`ShapEPipelineOutput`] or `tuple` + [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images. """ if isinstance(prompt, str): diff --git a/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py b/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py index 11446560295b..9c02b9ef2e59 100644 --- a/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py +++ b/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py @@ -67,11 +67,11 @@ @dataclass class ShapEPipelineOutput(BaseOutput): """ - Output class for ShapEPipeline. + Output class for [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`]. Args: images (`torch.FloatTensor`) - a list of images for 3D rendering + A list of images for 3D rendering. """ images: Union[PIL.Image.Image, np.ndarray] @@ -79,24 +79,24 @@ class ShapEPipelineOutput(BaseOutput): class ShapEImg2ImgPipeline(DiffusionPipeline): """ - Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E + Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E from an + image. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: prior ([`PriorTransformer`]): The canonincal unCLIP prior to approximate the image embedding from the text embedding. - text_encoder ([`CLIPTextModelWithProjection`]): - Frozen text-encoder. - tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + image_encoder ([`CLIPVisionModel`]): + Frozen image-encoder. + image_processor (`CLIPImageProcessor`): + A [`~transformers.CLIPImageProcessor`] to process images. scheduler ([`HeunDiscreteScheduler`]): A scheduler to be used in combination with `prior` to generate image embedding. shap_e_renderer ([`ShapERenderer`]): Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects - with the NeRF rendering method + with the NeRF rendering method. """ def __init__( @@ -174,40 +174,41 @@ def __call__( return_dict: bool = True, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: - prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. + image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): + `Image` or tensor representing an image batch to be used as the starting point. Can also accept image + latents as `image`, if passing latents directly, it will not be encoded again. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. num_inference_steps (`int`, *optional*, defaults to 100): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. frame_size (`int`, *optional*, default to 64): - the width and height of each image frame of the generated 3d output + The width and height of each image frame of the generated 3D output. output_type (`str`, *optional*, defaults to `"pt"`): (`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]). return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain + tuple. Examples: Returns: - [`ShapEPipelineOutput`] or `tuple` + [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images. """ if isinstance(image, PIL.Image.Image): diff --git a/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py b/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py index 66155ebf7f35..bb3922e77fd1 100644 --- a/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py +++ b/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py @@ -38,6 +38,21 @@ class SpectrogramDiffusionPipeline(DiffusionPipeline): + r""" + Pipeline for unconditional audio generation. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). + + Args: + notes_encoder ([`SpectrogramNotesEncoder`]): + continuous_encoder ([`SpectrogramContEncoder`]): + decoder ([`T5FilmDecoder`]): + A [`T5FilmDecoder`] to denoise the encoded audio latents. + scheduler ([`DDPMScheduler`]): + A scheduler to be used in combination with `decoder` to denoise the encoded audio latents. + melgan ([`OnnxRuntimeModel`]): + """ _optional_components = ["melgan"] def __init__( @@ -127,6 +142,48 @@ def __call__( f"`callback_steps` has to be a positive integer but is {callback_steps} of type" f" {type(callback_steps)}." ) + r""" + The call function to the pipeline for generation. + + Args: + input_tokens (`List[List[int]]`): + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + num_inference_steps (`int`, *optional*, defaults to 100): + The number of denoising steps. More denoising steps usually lead to a higher quality audio at the + expense of slower inference. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~pipelines.AudioPipelineOutput`] instead of a plain tuple. + output_type (`str`, *optional*, defaults to `"numpy"`): + The output format of the generated audio. + callback (`Callable`, *optional*): + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + callback_steps (`int`, *optional*, defaults to 1): + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. + + Example: + + ```py + >>> from diffusers import SpectrogramDiffusionPipeline, MidiProcessor + + >>> pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion") + >>> pipe = pipe.to("cuda") + >>> processor = MidiProcessor() + + >>> # Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid + >>> output = pipe(processor("beethoven_hammerklavier_2.mid")) + + >>> audio = output.audios[0] + ``` + + Returns: + [`pipelines.AudioPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`pipelines.AudioPipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated audio. + """ pred_mel = np.zeros([1, TARGET_FEATURE_LENGTH, self.n_dims], dtype=np.float32) full_pred_mel = np.zeros([1, 0, self.n_dims], np.float32) diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py index 113d11c6afcb..46fa336d88db 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py @@ -34,56 +34,35 @@ AUGS_CONST = ["A photo of ", "An image of ", "A picture of "] -EXAMPLE_DOC_STRING = """ - Examples: - ```py - >>> import torch - >>> from diffusers import StableDiffusionModelEditingPipeline - - >>> model_ckpt = "CompVis/stable-diffusion-v1-4" - >>> pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt) - - >>> pipe = pipe.to("cuda") - - >>> source_prompt = "A pack of roses" - >>> destination_prompt = "A pack of blue roses" - >>> pipe.edit_model(source_prompt, destination_prompt) - - >>> prompt = "A field of roses" - >>> image = pipe(prompt).images[0] - ``` -""" - class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for text-to-image model editing using "Editing Implicit Assumptions in Text-to-Image Diffusion Models". + Pipeline for text-to-image model editing. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.). + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. - feature_extractor ([`CLIPFeatureExtractor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. + feature_extractor ([`CLIPImageProcessor`]): + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. with_to_k ([`bool`]): - Whether to edit the key projection matrices along wiht the value projection matrices. + Whether to edit the key projection matrices along with the value projection matrices. with_augs ([`list`]): - Textual augmentations to apply while editing the text-to-image model. Set to [] for no augmentations. + Textual augmentations to apply while editing the text-to-image model. Set to `[]` for no augmentations. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -457,19 +436,19 @@ def edit_model( restart_params: bool = True, ): r""" - Apply model editing via closed-form solution (see Eq. 5 in the TIME paper https://arxiv.org/abs/2303.08084) + Apply model editing via closed-form solution (see Eq. 5 in the TIME [paper](https://arxiv.org/abs/2303.08084)). Args: source_prompt (`str`): The source prompt containing the concept to be edited. destination_prompt (`str`): - The destination prompt. Must contain all words from source_prompt with additional ones to specify the + The destination prompt. Must contain all words from `source_prompt` with additional ones to specify the target edit. lamb (`float`, *optional*, defaults to 0.1): The lambda parameter specifying the regularization intesity. Smaller values increase the editing power. restart_params (`bool`, *optional*, defaults to True): Restart the model parameters to their pre-trained version before editing. This is done to avoid edit - compounding. When it is False, edits accumulate. + compounding. When it is `False`, edits accumulate. """ # restart LDM parameters @@ -588,73 +567,82 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: + ```py + >>> import torch + >>> from diffusers import StableDiffusionModelEditingPipeline + + >>> model_ckpt = "CompVis/stable-diffusion-v1-4" + >>> pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt) + + >>> pipe = pipe.to("cuda") + + >>> source_prompt = "A pack of roses" + >>> destination_prompt = "A pack of blue roses" + >>> pipe.edit_model(source_prompt, destination_prompt) + + >>> prompt = "A field of roses" + >>> image = pipe(prompt).images[0] + ``` + Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py index 9d5bc8bdd8df..d26057fb7a1b 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py @@ -94,28 +94,27 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin) r""" Pipeline for text-to-image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -453,77 +452,67 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. sag_scale (`float`, *optional*, defaults to 0.75): - SAG scale as defined in [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance] - (https://arxiv.org/abs/2210.00939). `sag_scale` is defined as `s_s` of equation (24) of SAG paper: - https://arxiv.org/pdf/2210.00939.pdf. Typically chosen between [0, 1.0] for better quality. + Chosen between [0, 1.0] for better quality. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py index 73dddafa9172..70763878fb0c 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py @@ -54,15 +54,14 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL """ Pipeline for text-to-image generation using stable unCLIP. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: prior_tokenizer ([`CLIPTokenizer`]): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + A [`CLIPTokenizer`]. prior_text_encoder ([`CLIPTextModelWithProjection`]): - Frozen text-encoder. + Frozen [`CLIPTextModelWithProjection`] text-encoder. prior ([`PriorTransformer`]): The canonincal unCLIP prior to approximate the image embedding from the text embedding. prior_scheduler ([`KarrasDiffusionSchedulers`]): @@ -72,13 +71,13 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL embeddings after the noise has been applied. image_noising_scheduler ([`KarrasDiffusionSchedulers`]): Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined - by `noise_level` in `StableUnCLIPPipeline.__call__`. - tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + by the `noise_level`. + tokenizer ([`CLIPTokenizer`]): + A [`CLIPTokenizer`]. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + Frozen [`CLIPTextModel`] text-encoder. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`KarrasDiffusionSchedulers`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. vae ([`AutoencoderKL`]): @@ -160,10 +159,10 @@ def disable_vae_slicing(self): def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -573,8 +572,8 @@ def noise_image_embeddings( Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher `noise_level` increases the variance in the final un-noised images. - The noise is applied in two ways - 1. A noise schedule is applied directly to the embeddings + The noise is applied in two ways: + 1. A noise schedule is applied directly to the embeddings. 2. A vector of sinusoidal time embeddings are appended to the output. In both cases, the amount of noise is controlled by the same `noise_level`. @@ -637,87 +636,76 @@ def __call__( prior_latents: Optional[torch.FloatTensor] = None, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 20): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 10.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). noise_level (`int`, *optional*, defaults to `0`): The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in - the final un-noised images. See `StableUnCLIPPipeline.noise_image_embeddings` for details. + the final un-noised images. See [`StableUnCLIPPipeline.noise_image_embeddings`] for more details. prior_num_inference_steps (`int`, *optional*, defaults to 25): The number of denoising steps in the prior denoising process. More denoising steps usually lead to a higher quality image at the expense of slower inference. prior_guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale for the prior denoising process as defined in [Classifier-Free Diffusion - Guidance](https://arxiv.org/abs/2207.12598). `prior_guidance_scale` is defined as `w` of equation 2. of - [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting - `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to - the text `prompt`, usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. prior_latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image embedding generation in the prior denoising process. Can be used to tweak the same generation with - different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied - random `generator`. + different prompts. If not provided, a latents tensor is generated by sampling using the supplied random + `generator`. Examples: Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When returning + a tuple, the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py index cdf4254301a1..ddfcc0db7a41 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip_img2img.py @@ -65,10 +65,10 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): """ - Pipeline for text-guided image to image generation using stable unCLIP. + Pipeline for text-guided image-to-image generation using stable unCLIP. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: feature_extractor ([`CLIPImageProcessor`]): @@ -80,13 +80,13 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin embeddings after the noise has been applied. image_noising_scheduler ([`KarrasDiffusionSchedulers`]): Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined - by `noise_level` in `StableUnCLIPPipeline.__call__`. - tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + by the `noise_level`. + tokenizer (`~transformers.CLIPTokenizer`): + A [`~transformers.CLIPTokenizer`)]. + text_encoder ([`~transformers.CLIPTextModel`]): + Frozen [`~transformers.CLIPTextModel`] text-encoder. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`KarrasDiffusionSchedulers`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. vae ([`AutoencoderKL`]): @@ -162,10 +162,10 @@ def disable_vae_slicing(self): def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -540,8 +540,8 @@ def noise_image_embeddings( Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher `noise_level` increases the variance in the final un-noised images. - The noise is applied in two ways - 1. A noise schedule is applied directly to the embeddings + The noise is applied in two ways: + 1. A noise schedule is applied directly to the embeddings. 2. A vector of sinusoidal time embeddings are appended to the output. In both cases, the amount of noise is controlled by the same `noise_level`. @@ -601,82 +601,73 @@ def __call__( image_embeds: Optional[torch.FloatTensor] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): The prompt or prompts to guide the image generation. If not defined, either `prompt_embeds` will be used or prompt is initialized to `""`. image (`torch.FloatTensor` or `PIL.Image.Image`): - `Image`, or tensor representing an image batch. The image will be encoded to its CLIP embedding which - the unet will be conditioned on. Note that the image is _not_ encoded by the vae and then used as the - latents in the denoising process such as in the standard stable diffusion text guided image variation - process. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + `Image` or tensor representing an image batch. The image is encoded to its CLIP embedding which the + `unet` is conditioned on. The image is _not_ encoded by the `vae` and then used as the latents in the + denoising process like it is in the standard Stable Diffusion text-guided image variation process. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 20): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 10.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). noise_level (`int`, *optional*, defaults to `0`): The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in - the final un-noised images. See `StableUnCLIPPipeline.noise_image_embeddings` for details. + the final un-noised images. See [`StableUnCLIPPipeline.noise_image_embeddings`] for more details. image_embeds (`torch.FloatTensor`, *optional*): - Pre-generated CLIP embeddings to condition the unet on. Note that these are not latents to be used in - the denoising process. If you want to provide pre-generated latents, pass them to `__call__` as - `latents`. + Pre-generated CLIP embeddings to condition the `unet` on. These latents are not used in the denoising + process. If you want to provide pre-generated latents, pass them to `__call__` as `latents`. Examples: Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When returning + a tuple, the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py index 2e0ab15eb975..48ec538730b4 100644 --- a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py +++ b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py @@ -24,17 +24,13 @@ class KarrasVePipeline(DiffusionPipeline): r""" - Stochastic sampling from Karras et al. [1] tailored to the Variance-Expanding (VE) models [2]. Use Algorithm 2 and - the VE column of Table 1 from [1] for reference. - - [1] Karras, Tero, et al. "Elucidating the Design Space of Diffusion-Based Generative Models." - https://arxiv.org/abs/2206.00364 [2] Song, Yang, et al. "Score-based generative modeling through stochastic - differential equations." https://arxiv.org/abs/2011.13456 + Pipeline for unconditional image generation. Parameters: - unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. + unet ([`UNet2DModel`]): + A [`UNet2DModel`] to denoise the encoded image. scheduler ([`KarrasVeScheduler`]): - Scheduler for the diffusion process to be used in combination with `unet` to denoise the encoded image. + A scheduler to be used in combination with `unet` to denoise the encoded image. """ # add type hints for linting @@ -56,24 +52,32 @@ def __call__( **kwargs, ) -> Union[Tuple, ImagePipelineOutput]: r""" + The call function to the pipeline for generation. + Args: batch_size (`int`, *optional*, defaults to 1): The number of images to generate. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. + + Example: + + ```py + + ``` Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is - True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` + is returned where the first element is a list with the generated images. """ img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/text_to_video_synthesis/__init__.py b/src/diffusers/pipelines/text_to_video_synthesis/__init__.py index d70c1c2ea2a8..97683885aac9 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/__init__.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/__init__.py @@ -10,13 +10,12 @@ @dataclass class TextToVideoSDPipelineOutput(BaseOutput): """ - Output class for text to video pipelines. + Output class for text-to-video pipelines. Args: frames (`List[np.ndarray]` or `torch.FloatTensor`) List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as - a `torch` tensor. NumPy array present the denoised images of the diffusion pipeline. The length of the list - denotes the video length i.e., the number of frames. + a `torch` tensor. The length of the list denotes the video length (the number of frames). """ frames: Union[List[np.ndarray], torch.FloatTensor] diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py index b6600803747e..99edce7ef857 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py @@ -77,18 +77,18 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora r""" Pipeline for text-to-video generation. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Same as Stable Diffusion 2. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet3DConditionModel`]): Conditional U-Net architecture to denoise the encoded video latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet3DConditionModel`]): + A [`UNet3DConditionModel`] to denoise the encoded video latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -148,10 +148,10 @@ def disable_vae_tiling(self): def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -463,15 +463,14 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the video generation. If not defined, one has to pass `prompt_embeds`. - instead. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated video. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated video. num_frames (`int`, *optional*, defaults to 16): The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds @@ -480,55 +479,51 @@ def __call__( The number of denoising steps. More denoising steps usually lead to a higher quality videos at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate videos that are closely linked to the text `prompt`, - usually at the expense of lower video quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the video generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). + num_images_per_prompt (`int`, *optional*, defaults to 1): + The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. Latents should be of shape + tensor is generated by sampling using the supplied random `generator`. Latents should be of shape `(batch_size, num_channel, num_frames, height, width)`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"np"`): - The output format of the generate video. Choose between `torch.FloatTensor` or `np.array`. + The output format of the generated video. Choose between `torch.FloatTensor` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] instead + of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: - [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated frames. + [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is a list with the generated frames. """ # 0. Default height and width to unet height = height or self.unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py index c8745d79c58f..c5ef06997d3c 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py @@ -137,20 +137,20 @@ def preprocess_video(video): class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): r""" - Pipeline for text-to-video generation. + Pipeline for text-guided video-to-video generation. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Same as Stable Diffusion 2. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet3DConditionModel`]): Conditional U-Net architecture to denoise the encoded video latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet3DConditionModel`]): + A [`UNet3DConditionModel`] to denoise the encoded video latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -210,10 +210,10 @@ def disable_vae_tiling(self): def enable_model_cpu_offload(self, gpu_id=0): r""" - Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared - to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` - method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with - `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. + Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a + time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs. + Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the + iterative execution of the `unet`. """ if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): from accelerate import cpu_offload_with_hook @@ -547,75 +547,67 @@ def __call__( cross_attention_kwargs: Optional[Dict[str, Any]] = None, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the video generation. If not defined, one has to pass `prompt_embeds`. - instead. - video: (`List[np.ndarray]` or `torch.FloatTensor`): - `video` frames or tensor representing a video batch, that will be used as the starting point for the - process. Can also accpet video latents as `image`, if passing latents directly, it will not be encoded - again. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + video (`List[np.ndarray]` or `torch.FloatTensor`): + `video` frames or tensor representing a video batch to be used as the starting point for the process. + Can also accpet video latents as `image`, if passing latents directly, it will not be encoded again. strength (`float`, *optional*, defaults to 0.8): - Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` - will be used as a starting point, adding more noise to it the larger the `strength`. The number of - denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will - be maximum and the denoising process will run for the full number of iterations specified in - `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. + Indicates extent to transform the reference `video`. Must be between 0 and 1. `video` is used as a + starting point, adding more noise to it the larger the `strength`. The number of denoising steps + depends on the amount of noise initially added. When `strength` is 1, added noise is maximum and the + denoising process runs for the full number of iterations specified in `num_inference_steps`. A value of + 1 essentially ignores `video`. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality videos at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate videos that are closely linked to the text `prompt`, - usually at the expense of lower video quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the video generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in video generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. Latents should be of shape + tensor is generated by sampling using the supplied random `generator`. Latents should be of shape `(batch_size, num_channel, num_frames, height, width)`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. output_type (`str`, *optional*, defaults to `"np"`): - The output format of the generate video. Choose between `torch.FloatTensor` or `np.array`. + The output format of the generated video. Choose between `torch.FloatTensor` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] instead + of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. cross_attention_kwargs (`dict`, *optional*): - A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under - `self.processor` in - [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). Examples: Returns: - [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated frames. + [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] is + returned, otherwise a `tuple` is returned where the first element is a list with the generated frames. """ # 0. Default height and width to unet num_images_per_prompt = 1 diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py index fe7207f904f0..271d20a2bc7d 100644 --- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py +++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py @@ -172,6 +172,17 @@ def __call__(self, attn, hidden_states, encoder_hidden_states=None, attention_ma @dataclass class TextToVideoPipelineOutput(BaseOutput): + r""" + Output class for zero-shot text-to-video pipeline. + + Args: + images (`[List[PIL.Image.Image]`, `np.ndarray`]): + List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, + num_channels)`. + nsfw_content_detected (`[List[bool]]`): + List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or + `None` if safety checking could not be performed. + """ images: Union[List[PIL.Image.Image], np.ndarray] nsfw_content_detected: Optional[List[bool]] @@ -264,28 +275,27 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): r""" Pipeline for zero-shot text-to-video generation using Stable Diffusion. - This model inherits from [`StableDiffusionPipeline`]. Check the superclass documentation for the generic methods - the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet3DConditionModel`] to denoise the encoded video latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -311,16 +321,22 @@ def __init__( def forward_loop(self, x_t0, t0, t1, generator): """ - Perform ddpm forward process from time t0 to t1. This is the same as adding noise with corresponding variance. + Perform DDPM forward process from time t0 to t1. This is the same as adding noise with corresponding variance. Args: - x_t0: latent code at time t0 - t0: t0 - t1: t1 - generator: torch.Generator object + x_t0: + Latent code at time t0. + t0: + Timestep at t0. + t1: + Timestamp at t1. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. Returns: - x_t1: forward process applied to x_t0 from time t0 to t1. + x_t1: + Forward process applied to x_t0 from time t0 to t1. """ eps = torch.randn(x_t0.size(), generator=generator, dtype=x_t0.dtype, device=x_t0.device) alpha_vec = torch.prod(self.scheduler.alphas[t0:t1]) @@ -340,30 +356,35 @@ def backward_loop( cross_attention_kwargs=None, ): """ - Perform backward process given list of time steps + Perform backward process given list of time steps. Args: - latents: Latents at time timesteps[0]. - timesteps: time steps, along which to perform backward process. - prompt_embeds: Pre-generated text embeddings + latents: + Latents at time timesteps[0]. + timesteps: + Time steps along which to perform backward process. + prompt_embeds: + Pre-generated text embeddings. guidance_scale: - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. - extra_step_kwargs: extra_step_kwargs. - cross_attention_kwargs: cross_attention_kwargs. - num_warmup_steps: number of warmup steps. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. + extra_step_kwargs: + Extra_step_kwargs. + cross_attention_kwargs: + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). + num_warmup_steps: + number of warmup steps. Returns: - latents: latents of backward process output at time timesteps[-1] + latents: + Latents of backward process output at time timesteps[-1]. """ do_classifier_free_guidance = guidance_scale > 1.0 num_steps = (len(timesteps) - num_warmup_steps) // self.scheduler.order @@ -421,53 +442,50 @@ def __call__( frame_ids: Optional[List[int]] = None, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. - instead. - video_length (`int`, *optional*, defaults to 8): The number of generated video frames - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + video_length (`int`, *optional*, defaults to 8): + The number of generated video frames. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). + The prompt or prompts to guide what to not include in video generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_videos_per_prompt (`int`, *optional*, defaults to 1): The number of videos to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"numpy"`): - The output format of the generated image. Choose between `"latent"` and `"numpy"`. + The output format of the generated video. Choose between `"latent"` and `"numpy"`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a - plain tuple. + Whether or not to return a + [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput`] instead of + a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. motion_field_strength_x (`float`, *optional*, defaults to 12): Strength of motion in generated video along x-axis. See the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1. @@ -485,10 +503,10 @@ def __call__( chunk-by-chunk. Returns: - [`~pipelines.text_to_video_synthesis.TextToVideoPipelineOutput`]: - The output contains a ndarray of the generated images, when output_type != 'latent', otherwise a latent - codes of generated image, and a list of `bool`s denoting whether the corresponding generated image - likely represents "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput`]: + The output contains a `ndarray` of the generated video, when `output_type` != `"latent"`, otherwise a + latent code of generated videos and a list of `bool`s indicating whether the corresponding generated + video contains "not-safe-for-work" (nsfw) content.. """ assert video_length > 0 if frame_ids is None: diff --git a/src/diffusers/pipelines/unclip/pipeline_unclip.py b/src/diffusers/pipelines/unclip/pipeline_unclip.py index f3b67bebfc8d..77fa2d1f8acd 100644 --- a/src/diffusers/pipelines/unclip/pipeline_unclip.py +++ b/src/diffusers/pipelines/unclip/pipeline_unclip.py @@ -33,33 +33,32 @@ class UnCLIPPipeline(DiffusionPipeline): """ - Pipeline for text-to-image generation using unCLIP + Pipeline for text-to-image generation using unCLIP. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: text_encoder ([`CLIPTextModelWithProjection`]): Frozen text-encoder. tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + A [`~transformers.CLIPTokenizer`] to tokenize text. prior ([`PriorTransformer`]): - The canonincal unCLIP prior to approximate the image embedding from the text embedding. + The canonical unCLIP prior to approximate the image embedding from the text embedding. text_proj ([`UnCLIPTextProjModel`]): Utility class to prepare and combine the embeddings before they are passed to the decoder. decoder ([`UNet2DConditionModel`]): The decoder to invert the image embedding into an image. super_res_first ([`UNet2DModel`]): - Super resolution unet. Used in all but the last step of the super resolution diffusion process. + Super resolution UNet. Used in all but the last step of the super resolution diffusion process. super_res_last ([`UNet2DModel`]): - Super resolution unet. Used in the last step of the super resolution diffusion process. + Super resolution UNet. Used in the last step of the super resolution diffusion process. prior_scheduler ([`UnCLIPScheduler`]): - Scheduler used in the prior denoising process. Just a modified DDPMScheduler. + Scheduler used in the prior denoising process (a modified [`DDPMScheduler`]). decoder_scheduler ([`UnCLIPScheduler`]): - Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. + Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]). super_res_scheduler ([`UnCLIPScheduler`]): - Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler. + Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]). """ @@ -227,12 +226,12 @@ def __call__( return_dict: bool = True, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. This can only be left undefined if - `text_model_output` and `text_attention_mask` is passed. + The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output` + and `text_attention_mask` is passed. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. prior_num_inference_steps (`int`, *optional*, defaults to 25): @@ -245,8 +244,8 @@ def __call__( The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. prior_latents (`torch.FloatTensor` of shape (batch size, embeddings dimension), *optional*): Pre-generated noisy latents to be used as inputs for the prior. decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*): @@ -254,29 +253,27 @@ def __call__( super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*): Pre-generated noisy latents to be used as inputs for the decoder. prior_guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. decoder_guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. text_model_output (`CLIPTextModelOutput`, *optional*): - Pre-defined CLIPTextModel outputs that can be derived from the text encoder. Pre-defined text outputs - can be passed for tasks like text embedding interpolations. Make sure to also pass - `text_attention_mask` in this case. `prompt` can the be left to `None`. + Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text + outputs can be passed for tasks like text embedding interpolations. Make sure to also pass + `text_attention_mask` in this case. `prompt` can the be left `None`. text_attention_mask (`torch.Tensor`, *optional*): Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention masks are necessary when passing `text_model_output`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generated image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + + Returns: + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ if prompt is not None: if isinstance(prompt, str): diff --git a/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py b/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py index 580417f517be..820a94c51623 100644 --- a/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py +++ b/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py @@ -37,36 +37,32 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline): """ - Pipeline to generate variations from an input image using unCLIP + Pipeline for image-guided image generation using UnCLIP. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: text_encoder ([`CLIPTextModelWithProjection`]): Frozen text-encoder. tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + A [`~transformers.CLIPTokenizer`] to tokenize text. feature_extractor ([`CLIPImageProcessor`]): Model that extracts features from generated images to be used as inputs for the `image_encoder`. image_encoder ([`CLIPVisionModelWithProjection`]): - Frozen CLIP image-encoder. unCLIP Image Variation uses the vision portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection), - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). text_proj ([`UnCLIPTextProjModel`]): Utility class to prepare and combine the embeddings before they are passed to the decoder. decoder ([`UNet2DConditionModel`]): The decoder to invert the image embedding into an image. super_res_first ([`UNet2DModel`]): - Super resolution unet. Used in all but the last step of the super resolution diffusion process. + Super resolution UNet. Used in all but the last step of the super resolution diffusion process. super_res_last ([`UNet2DModel`]): - Super resolution unet. Used in the last step of the super resolution diffusion process. + Super resolution UNet. Used in the last step of the super resolution diffusion process. decoder_scheduler ([`UnCLIPScheduler`]): - Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. + Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]). super_res_scheduler ([`UnCLIPScheduler`]): - Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler. - + Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]). """ decoder: UNet2DConditionModel @@ -214,14 +210,14 @@ def __call__( return_dict: bool = True, ): """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`): - The image or images to guide the image generation. If you provide a tensor, it needs to comply with the - configuration of - [this](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json) - `CLIPImageProcessor`. Can be left to `None` only when `image_embeddings` are passed. + `Image` or tensor representing an image batch to be used as the starting point. If you provide a + tensor, it needs to be compatible with the [`CLIPImageProcessor`] + [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json). + Can be left as `None` only when `image_embeddings` are passed. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. decoder_num_inference_steps (`int`, *optional*, defaults to 25): @@ -231,26 +227,27 @@ def __call__( The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*): Pre-generated noisy latents to be used as inputs for the decoder. super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*): Pre-generated noisy latents to be used as inputs for the decoder. decoder_guidance_scale (`float`, *optional*, defaults to 4.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. image_embeddings (`torch.Tensor`, *optional*): Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings - can be passed for tasks like image interpolations. `image` can the be left to `None`. + can be passed for tasks like image interpolations. `image` can be left as `None`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generated image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. + + Returns: + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ if image is not None: if isinstance(image, PIL.Image.Image): diff --git a/src/diffusers/pipelines/unidiffuser/pipeline_unidiffuser.py b/src/diffusers/pipelines/unidiffuser/pipeline_unidiffuser.py index 3632d74d1c12..417187a5719f 100644 --- a/src/diffusers/pipelines/unidiffuser/pipeline_unidiffuser.py +++ b/src/diffusers/pipelines/unidiffuser/pipeline_unidiffuser.py @@ -81,43 +81,33 @@ class ImageTextPipelineOutput(BaseOutput): class UniDiffuserPipeline(DiffusionPipeline): r""" - Pipeline for a bimodal image-text [UniDiffuser](https://arxiv.org/pdf/2303.06555.pdf) model, which supports - unconditional text and image generation, text-conditioned image generation, image-conditioned text generation, and - joint image-text generation. + Pipeline for a bimodal image-text model which supports unconditional text and image generation, text-conditioned + image generation, image-conditioned text generation, and joint image-text generation. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. This - is part of the UniDiffuser image representation, along with the CLIP vision encoding. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. This + is part of the UniDiffuser image representation along with the CLIP vision encoding. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Similar to Stable Diffusion, UniDiffuser uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel) to encode text - prompts. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). image_encoder ([`CLIPVisionModel`]): - UniDiffuser uses the vision portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModel) to encode - images as part of its image representation, along with the VAE latent representation. + A [`~transformers.CLIPVisionModel`] to encode images as part of its image representation along with the VAE + latent representation. image_processor ([`CLIPImageProcessor`]): - CLIP image processor of class - [CLIPImageProcessor](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor), - used to preprocess the image before CLIP encoding it with `image_encoder`. + [`~transformers.CLIPImageProcessor`] to preprocess an image before CLIP encoding it with `image_encoder`. clip_tokenizer ([`CLIPTokenizer`]): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTokenizer) which - is used to tokenizer a prompt before encoding it with `text_encoder`. + A [`~transformers.CLIPTokenizer`] to tokenize the prompt before encoding it with `text_encoder`. text_decoder ([`UniDiffuserTextDecoder`]): Frozen text decoder. This is a GPT-style model which is used to generate text from the UniDiffuser embedding. text_tokenizer ([`GPT2Tokenizer`]): - Tokenizer of class - [GPT2Tokenizer](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer) which - is used along with the `text_decoder` to decode text for text generation. + A [`~transformers.GPT2Tokenizer`] to decode text for text generation; used along with the `text_decoder`. unet ([`UniDiffuserModel`]): - UniDiffuser uses a [U-ViT](https://github.com/baofff/U-ViT) model architecture, which is similar to a - [`Transformer2DModel`] with U-Net-style skip connections between transformer layers. + A [U-ViT](https://github.com/baofff/U-ViT) model with UNNet-style skip connections between transformer + layers to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image and/or text latents. The original UniDiffuser paper uses the [`DPMSolverMultistepScheduler`] scheduler. @@ -1062,14 +1052,14 @@ def __call__( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds` - instead. Required for text-conditioned image generation (`text2img`) mode. + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + Required for text-conditioned image generation (`text2img`) mode. image (`torch.FloatTensor` or `PIL.Image.Image`, *optional*): - `Image`, or tensor representing an image batch. Required for image-conditioned text generation + `Image` or tensor representing an image batch. Required for image-conditioned text generation (`img2text`) mode. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. @@ -1077,78 +1067,74 @@ def __call__( The width in pixels of the generated image. data_type (`int`, *optional*, defaults to 1): The data type (either 0 or 1). Only used if you are loading a checkpoint which supports a data type - embedding; this is added for compatibility with the UniDiffuser-v1 checkpoint. + embedding; this is added for compatibility with the + [UniDiffuser-v1](https://huggingface.co/thu-ml/unidiffuser-v1) checkpoint. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 8.0): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. Note that the original [UniDiffuser - paper](https://arxiv.org/pdf/2303.06555.pdf) uses a different definition of the guidance scale `w'`, - which satisfies `w = w' + 1`. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. If not defined, one has to pass - `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is - less than `1`). Used in text-conditioned image generation (`text2img`) mode. + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). Used in + text-conditioned image generation (`text2img`) mode. num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. Used in `text2img` (text-conditioned image generation) and `img` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are - supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples will be generated. + supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples are generated. num_prompts_per_image (`int`, *optional*, defaults to 1): The number of prompts to generate per image. Used in `img2text` (image-conditioned text generation) and `text` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are - supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples will be generated. + supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples are generated. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator` or `List[torch.Generator]`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for joint + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for joint image-text generation. Can be used to tweak the same generation with different prompts. If not - provided, a latents tensor will be generated by sampling using the supplied random `generator`. Note - that this is assumed to be a full set of VAE, CLIP, and text latents, if supplied, this will override - the value of `prompt_latents`, `vae_latents`, and `clip_latents`. + provided, a latents tensor is generated by sampling using the supplied random `generator`. This assumes + a full set of VAE, CLIP, and text latents, if supplied, overrides the value of `prompt_latents`, + `vae_latents`, and `clip_latents`. prompt_latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for text + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for text generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will be generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. vae_latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will be generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. clip_latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will be generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not - provided, text embeddings will be generated from `prompt` input argument. Used in text-conditioned + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. Used in text-conditioned image generation (`text2img`) mode. negative_prompt_embeds (`torch.FloatTensor`, *optional*): - Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt - weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input - argument. Used in text-conditioned image generation (`text2img`) mode. + Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If + not provided, `negative_prompt_embeds` are be generated from the `negative_prompt` input argument. Used + in text-conditioned image generation (`text2img`) mode. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.unidiffuser.ImageTextPipelineOutput`] instead of a plain tuple. + Whether or not to return a [`~pipelines.ImageTextPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. + Returns: [`~pipelines.unidiffuser.ImageTextPipelineOutput`] or `tuple`: - [`pipelines.unidiffuser.ImageTextPipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When - returning a tuple, the first element is a list with the generated images, and the second element is a list - of generated texts. + If `return_dict` is `True`, [`~pipelines.unidiffuser.ImageTextPipelineOutput`] is returned, otherwise a + `tuple` is returned where the first element is a list with the generated images and the second element + is a list of generated texts. """ # 0. Default height and width to unet diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py index 6d6b5e7863eb..5a730c3ed890 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py @@ -21,28 +21,27 @@ class VersatileDiffusionPipeline(DiffusionPipeline): r""" Pipeline for text-to-image generation using Stable Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. Stable Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + A [`~transformers.CLIPTokenizer`] to tokenize text. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionMegaSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. - Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. feature_extractor ([`CLIPImageProcessor`]): - Model that extracts features from generated images to be used as inputs for the `safety_checker`. + A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. """ tokenizer: CLIPTokenizer @@ -98,51 +97,47 @@ def image_variation( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`): The image prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -171,10 +166,10 @@ def image_variation( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ expected_components = inspect.signature(VersatileDiffusionImageVariationPipeline.__init__).parameters.keys() components = {name: component for name, component in self.components.items() if name in expected_components} @@ -214,51 +209,47 @@ def text_to_image( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -278,10 +269,10 @@ def text_to_image( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ expected_components = inspect.signature(VersatileDiffusionTextToImagePipeline.__init__).parameters.keys() components = {name: component for name, component in self.components.items() if name in expected_components} @@ -327,51 +318,47 @@ def dual_guided( callback_steps: int = 1, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -405,8 +392,8 @@ def dual_guided( Returns: [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple. When - returning a tuple, the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise + a `tuple` is returned where the first element is a list with the generated images. """ expected_components = inspect.signature(VersatileDiffusionDualGuidedPipeline.__init__).parameters.keys() diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py index 5986d66a61e7..495462a3cfb8 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py @@ -40,18 +40,20 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for image-text dual-guided generation using Versatile Diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) Model to encode and decode images to and from latent representations. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): - Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. - tokenizer (`transformers.BertTokenizer`): - Tokenizer of class - [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + Text-encoder model based on [`~transformers.BERT`]. + tokenizer ([`transformers.BertTokenizer`]): + A [`transformers.BertTokenizer`]. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -395,51 +397,47 @@ def __call__( **kwargs, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. - generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -474,8 +472,8 @@ def __call__( Returns: [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple. When - returning a tuple, the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise + a `tuple` is returned where the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py index 154548df7542..7b4d77026ab1 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py @@ -34,18 +34,20 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for image variation using Versatile Diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) Model to encode and decode images to and from latent representations. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): - Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. - tokenizer (`transformers.BertTokenizer`): - Tokenizer of class - [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + Text-encoder model based on [`~transformers.BERT`]. + tokenizer ([`transformers.BertTokenizer`]): + A [`transformers.BertTokenizer`]. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -247,51 +249,47 @@ def __call__( **kwargs, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`): The image prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -320,10 +318,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py index 1ba5d8451f2e..5cdde87fee68 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py @@ -33,18 +33,20 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline): r""" - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + Pipeline for text-to-image generation using Versatile Diffusion. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Parameters: vqvae ([`VQModel`]): - Vector-quantized (VQ) Model to encode and decode images to and from latent representations. + Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): - Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. - tokenizer (`transformers.BertTokenizer`): - Tokenizer of class - [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). - unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. + Text-encoder model based on [`~transformers.BERT`]. + tokenizer ([`~transformers.BertTokenizer`]): + A [`~transformers.BertTokenizer`]. + unet ([`UNet2DConditionModel`]): + A [`UNet2DConditionModel`] to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -329,51 +331,47 @@ def __call__( **kwargs, ): r""" - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. - height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. + height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): - The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored - if `guidance_scale` is less than `1`). + The prompt or prompts to guide what to not include in image generation. If not defined, you need to + pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. eta (`float`, *optional*, defaults to 0.0): - Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to - [`schedulers.DDIMScheduler`], will be ignored for others. + Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies + to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor`, *optional*): - Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor will ge generated by sampling using the supplied random `generator`. + tensor is generated by sampling using the supplied random `generator`. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generate image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Examples: @@ -394,10 +392,10 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. - When returning a tuple, the first element is a list with the generated images, and the second element is a - list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" - (nsfw) content, according to the `safety_checker`. + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py index 9147afe127e4..fc9cd57a085f 100644 --- a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py +++ b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py @@ -51,24 +51,21 @@ def __init__(self, learnable: bool, hidden_size: Optional[int] = None, length: O class VQDiffusionPipeline(DiffusionPipeline): r""" - Pipeline for text-to-image generation using VQ Diffusion + Pipeline for text-to-image generation using VQ Diffusion. - This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the - library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: vqvae ([`VQModel`]): - Vector Quantized Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent + Vector Quantized Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): - Frozen text-encoder. VQ Diffusion uses the text portion of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically - the [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) variant. + Frozen text-encoder ([clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)). tokenizer (`CLIPTokenizer`): - Tokenizer of class - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). + A [`~transformers.CLIPTokenizer`] to tokenize text. transformer ([`Transformer2DModel`]): - Conditional transformer to denoise the encoded image latents. + A [`Transformer2DModel`] to denoise the encoded image latents. scheduler ([`VQDiffusionScheduler`]): A scheduler to be used in combination with `transformer` to denoise the encoded image latents. """ @@ -179,20 +176,17 @@ def __call__( callback_steps: int = 1, ) -> Union[ImagePipelineOutput, Tuple]: """ - Function invoked when calling the pipeline for generation. + The call function to the pipeline for generation. Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide the image generation. + The prompt or prompts to guide image generation. num_inference_steps (`int`, *optional*, defaults to 100): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): - Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). - `guidance_scale` is defined as `w` of equation 2. of [Imagen - Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > - 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, - usually at the expense of lower image quality. + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. truncation_rate (`float`, *optional*, defaults to 1.0 (equivalent to no truncation)): Used to "truncate" the predicted classes for x_0 such that the cumulative probability for a pixel is at most `truncation_rate`. The lowest probabilities that would increase the cumulative probability above @@ -200,27 +194,27 @@ def __call__( num_images_per_prompt (`int`, *optional*, defaults to 1): The number of images to generate per prompt. generator (`torch.Generator`, *optional*): - One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) - to make generation deterministic. + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. latents (`torch.FloatTensor` of shape (batch), *optional*): - Pre-generated noisy latents to be used as inputs for image generation. Must be valid embedding indices. - Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will - be generated of completely masked latent pixels. + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image + generation. Must be valid embedding indices.If not provided, a latents tensor will be generated of + completely masked latent pixels. output_type (`str`, *optional*, defaults to `"pil"`): - The output format of the generated image. Choose between - [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. + The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): - A function that will be called every `callback_steps` steps during inference. The function will be - called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. callback_steps (`int`, *optional*, defaults to 1): - The frequency at which the `callback` function will be called. If not specified, the callback will be - called at every step. + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. Returns: - [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput `] if `return_dict` - is True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. + [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise + a `tuple` is returned where the first element is a list with the generated images. """ if isinstance(prompt, str): batch_size = 1 From ff51948452f4442258c29a376267e931ed28e105 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 7 Jul 2023 15:42:13 -0700 Subject: [PATCH 09/13] fix copies --- .../pipeline_stable_diffusion_xl.py | 17 +++++++---------- .../pipeline_stable_diffusion_xl_img2img.py | 17 +++++++---------- 2 files changed, 14 insertions(+), 20 deletions(-) diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py index afb4469f1bd4..9b5c18c78a4c 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py @@ -146,17 +146,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -164,17 +162,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py index 20d1c341c552..302e9347beeb 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py @@ -153,17 +153,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -171,17 +169,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() From d9195366865309eade7d2f38fdba4006cc202f79 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 13 Jul 2023 17:16:51 -0700 Subject: [PATCH 10/13] second review --- docs/source/en/_toctree.yml | 14 +- .../source/en/api/pipelines/alt_diffusion.mdx | 6 + .../en/api/pipelines/attend_and_excite.mdx | 12 +- .../en/api/pipelines/audio_diffusion.mdx | 6 + docs/source/en/api/pipelines/audioldm.mdx | 14 +- .../en/api/pipelines/consistency_models.mdx | 8 +- .../en/api/pipelines/cycle_diffusion.mdx | 6 + .../en/api/pipelines/dance_diffusion.mdx | 6 + docs/source/en/api/pipelines/ddim.mdx | 6 + docs/source/en/api/pipelines/ddpm.mdx | 6 + docs/source/en/api/pipelines/dit.mdx | 6 + .../en/api/pipelines/latent_diffusion.mdx | 6 + .../api/pipelines/latent_diffusion_uncond.mdx | 6 + .../source/en/api/pipelines/model_editing.mdx | 8 +- docs/source/en/api/pipelines/overview.mdx | 4 +- .../en/api/pipelines/paint_by_example.mdx | 10 + docs/source/en/api/pipelines/panorama.mdx | 16 +- docs/source/en/api/pipelines/paradigms.mdx | 6 + docs/source/en/api/pipelines/pix2pix.mdx | 8 +- docs/source/en/api/pipelines/pix2pix_zero.mdx | 2 +- docs/source/en/api/pipelines/pndm.mdx | 6 + docs/source/en/api/pipelines/repaint.mdx | 7 + docs/source/en/api/pipelines/score_sde_ve.mdx | 6 + .../api/pipelines/self_attention_guidance.mdx | 8 +- .../pipelines/semantic_stable_diffusion.mdx | 6 + docs/source/en/api/pipelines/shap_e.mdx | 6 + .../api/pipelines/spectrogram_diffusion.mdx | 6 + .../pipelines/stable_diffusion/depth2img.mdx | 10 +- .../stable_diffusion/image_variation.mdx | 12 +- .../pipelines/stable_diffusion/img2img.mdx | 12 +- .../pipelines/stable_diffusion/inpaint.mdx | 22 +- .../stable_diffusion/latent_upscale.mdx | 12 +- .../stable_diffusion/ldm3d_diffusion.mdx | 8 +- .../pipelines/stable_diffusion/overview.mdx | 193 ++++++++++++++---- .../stable_diffusion/stable_diffusion_2.mdx | 117 ++++++++++- .../stable_diffusion_safe.mdx | 8 + .../pipelines/stable_diffusion/text2img.mdx | 12 +- .../pipelines/stable_diffusion/upscale.mdx | 10 +- .../en/api/pipelines/stochastic_karras_ve.mdx | 6 + docs/source/en/api/pipelines/unclip.mdx | 8 +- .../en/api/pipelines/versatile_diffusion.mdx | 18 +- docs/source/en/api/pipelines/vq_diffusion.mdx | 6 + src/diffusers/loaders.py | 36 ++-- .../alt_diffusion/pipeline_alt_diffusion.py | 23 ++- .../pipeline_alt_diffusion_img2img.py | 25 +-- .../pipelines/audio_diffusion/mel.py | 4 +- .../pipeline_audio_diffusion.py | 19 +- .../pipelines/audioldm/pipeline_audioldm.py | 10 +- .../pipeline_consistency_models.py | 2 +- .../pipeline_dance_diffusion.py | 3 +- src/diffusers/pipelines/ddim/pipeline_ddim.py | 3 +- src/diffusers/pipelines/ddpm/pipeline_ddpm.py | 5 +- src/diffusers/pipelines/dit/pipeline_dit.py | 2 +- .../pipeline_latent_diffusion.py | 7 +- ...peline_latent_diffusion_superresolution.py | 5 +- .../pipeline_latent_diffusion_uncond.py | 5 +- .../pipeline_paint_by_example.py | 17 +- .../pipelines/pipeline_flax_utils.py | 8 +- src/diffusers/pipelines/pipeline_utils.py | 18 +- src/diffusers/pipelines/pndm/pipeline_pndm.py | 7 +- .../pipelines/repaint/pipeline_repaint.py | 10 +- .../score_sde_ve/pipeline_score_sde_ve.py | 12 +- .../semantic_stable_diffusion/__init__.py | 2 +- .../pipeline_semantic_stable_diffusion.py | 18 +- .../pipelines/stable_diffusion/__init__.py | 8 +- .../pipeline_cycle_diffusion.py | 18 +- .../pipeline_flax_stable_diffusion.py | 18 +- .../pipeline_flax_stable_diffusion_img2img.py | 25 +-- .../pipeline_flax_stable_diffusion_inpaint.py | 23 ++- .../pipeline_stable_diffusion.py | 23 ++- ...line_stable_diffusion_attend_and_excite.py | 16 +- .../pipeline_stable_diffusion_depth2img.py | 18 +- ...peline_stable_diffusion_image_variation.py | 18 +- .../pipeline_stable_diffusion_img2img.py | 25 +-- .../pipeline_stable_diffusion_inpaint.py | 22 +- ...eline_stable_diffusion_instruct_pix2pix.py | 26 ++- ...ipeline_stable_diffusion_latent_upscale.py | 11 +- .../pipeline_stable_diffusion_ldm3d.py | 28 ++- ...pipeline_stable_diffusion_model_editing.py | 17 +- .../pipeline_stable_diffusion_panorama.py | 27 +-- .../pipeline_stable_diffusion_paradigms.py | 26 ++- .../pipeline_stable_diffusion_sag.py | 14 +- .../pipeline_stable_diffusion_upscale.py | 13 +- .../pipeline_stable_diffusion_safe.py | 16 +- .../pipeline_stochastic_karras_ve.py | 6 +- .../pipelines/unclip/pipeline_unclip.py | 6 +- .../unclip/pipeline_unclip_image_variation.py | 12 +- .../pipeline_versatile_diffusion.py | 14 +- ...ipeline_versatile_diffusion_dual_guided.py | 6 +- ...ine_versatile_diffusion_image_variation.py | 6 +- ...eline_versatile_diffusion_text_to_image.py | 4 +- .../vq_diffusion/pipeline_vq_diffusion.py | 13 +- 92 files changed, 880 insertions(+), 475 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 4ae67cfbe096..a55ebfc11a4a 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -236,15 +236,15 @@ - local: api/pipelines/stable_diffusion/overview title: Overview - local: api/pipelines/stable_diffusion/text2img - title: Text-to-Image + title: Text-to-image - local: api/pipelines/stable_diffusion/img2img - title: Image-to-Image + title: Image-to-image - local: api/pipelines/stable_diffusion/inpaint - title: Inpaint + title: Inpainting - local: api/pipelines/stable_diffusion/depth2img - title: Depth-to-Image + title: Depth-to-image - local: api/pipelines/stable_diffusion/image_variation - title: Image Variation + title: Image variation - local: api/pipelines/stable_diffusion/stable_diffusion_safe title: Safe Stable Diffusion - local: api/pipelines/stable_diffusion/stable_diffusion_2 @@ -252,9 +252,9 @@ - local: api/pipelines/stable_diffusion/stable_diffusion_xl title: Stable Diffusion XL - local: api/pipelines/stable_diffusion/latent_upscale - title: Latent Upscaler + title: Latent upscaler - local: api/pipelines/stable_diffusion/upscale - title: Super Resolution + title: Super-resolution - local: api/pipelines/stable_diffusion/ldm3d_diffusion title: LDM3D Text-to-(RGB, Depth) - local: api/pipelines/stable_diffusion/adapter diff --git a/docs/source/en/api/pipelines/alt_diffusion.mdx b/docs/source/en/api/pipelines/alt_diffusion.mdx index d5f7031d59f9..e3d248f31db4 100644 --- a/docs/source/en/api/pipelines/alt_diffusion.mdx +++ b/docs/source/en/api/pipelines/alt_diffusion.mdx @@ -18,6 +18,12 @@ The abstract from the paper is: *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.* + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## AltDiffusionPipeline [[autodoc]] AltDiffusionPipeline diff --git a/docs/source/en/api/pipelines/attend_and_excite.mdx b/docs/source/en/api/pipelines/attend_and_excite.mdx index 3c9fde50ca35..ee205b8b283f 100644 --- a/docs/source/en/api/pipelines/attend_and_excite.mdx +++ b/docs/source/en/api/pipelines/attend_and_excite.mdx @@ -10,15 +10,21 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Attend and Excite +# Attend-and-Excite -Attend and Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. +Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* -You can find additional information about Attend and Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), [paper](https://arxiv.org/abs/2301.13826), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). +You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## StableDiffusionAttendAndExcitePipeline diff --git a/docs/source/en/api/pipelines/audio_diffusion.mdx b/docs/source/en/api/pipelines/audio_diffusion.mdx index 20b97a80a733..cc52c70a8e9e 100644 --- a/docs/source/en/api/pipelines/audio_diffusion.mdx +++ b/docs/source/en/api/pipelines/audio_diffusion.mdx @@ -16,6 +16,12 @@ specific language governing permissions and limitations under the License. The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## AudioDiffusionPipeline [[autodoc]] AudioDiffusionPipeline - all diff --git a/docs/source/en/api/pipelines/audioldm.mdx b/docs/source/en/api/pipelines/audioldm.mdx index 36120b9ec585..2407a205c92b 100644 --- a/docs/source/en/api/pipelines/audioldm.mdx +++ b/docs/source/en/api/pipelines/audioldm.mdx @@ -12,13 +12,15 @@ specific language governing permissions and limitations under the License. # AudioLDM -AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. - -Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM +AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. +The abstract from the paper is: + +*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.* + The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM), and the pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). ## Tips @@ -33,6 +35,12 @@ During inference: * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## AudioLDMPipeline [[autodoc]] AudioLDMPipeline - all diff --git a/docs/source/en/api/pipelines/consistency_models.mdx b/docs/source/en/api/pipelines/consistency_models.mdx index fa4d36102d03..cbe2691e67d5 100644 --- a/docs/source/en/api/pipelines/consistency_models.mdx +++ b/docs/source/en/api/pipelines/consistency_models.mdx @@ -2,7 +2,7 @@ Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. -The abstract from the [paper](https://arxiv.org/pdf/2303.01469.pdf) is: +The abstract from the paper is: *Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. * @@ -34,6 +34,12 @@ For an additional speed-up, use `torch.compile` to generate multiple images in < image.show() ``` + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## ConsistencyModelPipeline [[autodoc]] ConsistencyModelPipeline - all diff --git a/docs/source/en/api/pipelines/cycle_diffusion.mdx b/docs/source/en/api/pipelines/cycle_diffusion.mdx index e0f74ae94845..3ff0d768879a 100644 --- a/docs/source/en/api/pipelines/cycle_diffusion.mdx +++ b/docs/source/en/api/pipelines/cycle_diffusion.mdx @@ -18,6 +18,12 @@ The abstract from the paper is: *Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.* + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## CycleDiffusionPipeline [[autodoc]] CycleDiffusionPipeline - all diff --git a/docs/source/en/api/pipelines/dance_diffusion.mdx b/docs/source/en/api/pipelines/dance_diffusion.mdx index 9d8ceb5b8868..1510454d178f 100644 --- a/docs/source/en/api/pipelines/dance_diffusion.mdx +++ b/docs/source/en/api/pipelines/dance_diffusion.mdx @@ -18,6 +18,12 @@ Dance Diffusion is the first in a suite of generative audio tools for producers The original codebase of this implementation can be found at [Harmonai-org](https://github.com/Harmonai-org/sample-generator). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## DanceDiffusionPipeline [[autodoc]] DanceDiffusionPipeline - all diff --git a/docs/source/en/api/pipelines/ddim.mdx b/docs/source/en/api/pipelines/ddim.mdx index 98da201545b9..04e2f0a33bce 100644 --- a/docs/source/en/api/pipelines/ddim.mdx +++ b/docs/source/en/api/pipelines/ddim.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author at [tsong.me](https://tsong.me/). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## DDIMPipeline [[autodoc]] DDIMPipeline - all diff --git a/docs/source/en/api/pipelines/ddpm.mdx b/docs/source/en/api/pipelines/ddpm.mdx index 1f615106bcfe..3efa603d1cae 100644 --- a/docs/source/en/api/pipelines/ddpm.mdx +++ b/docs/source/en/api/pipelines/ddpm.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + # DDPMPipeline [[autodoc]] DDPMPipeline - all diff --git a/docs/source/en/api/pipelines/dit.mdx b/docs/source/en/api/pipelines/dit.mdx index 26ca122b6c4e..8f3a8df88c4a 100644 --- a/docs/source/en/api/pipelines/dit.mdx +++ b/docs/source/en/api/pipelines/dit.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## DiTPipeline [[autodoc]] DiTPipeline - all diff --git a/docs/source/en/api/pipelines/latent_diffusion.mdx b/docs/source/en/api/pipelines/latent_diffusion.mdx index 19d5e3a5f41b..e0398dbe0468 100644 --- a/docs/source/en/api/pipelines/latent_diffusion.mdx +++ b/docs/source/en/api/pipelines/latent_diffusion.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [Compvis/latent-diffusion](https://github.com/CompVis/latent-diffusion). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## LDMTextToImagePipeline [[autodoc]] LDMTextToImagePipeline - all diff --git a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx index fe54056577cd..8555d631d43c 100644 --- a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx +++ b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## LDMPipeline [[autodoc]] LDMPipeline - all diff --git a/docs/source/en/api/pipelines/model_editing.mdx b/docs/source/en/api/pipelines/model_editing.mdx index e33bb43584a1..823f4f20237b 100644 --- a/docs/source/en/api/pipelines/model_editing.mdx +++ b/docs/source/en/api/pipelines/model_editing.mdx @@ -18,7 +18,13 @@ The abstract from the paper is: *Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.* -You can find additional information about model editing on the [project page](https://time-diffusion.github.io/), [paper](https://arxiv.org/abs/2303.08084), [original codebase](https://github.com/bahjat-kawar/time-diffusion), and try it out in a [demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion). +You can find additional information about model editing on the [project page](https://time-diffusion.github.io/), [original codebase](https://github.com/bahjat-kawar/time-diffusion), and try it out in a [demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion). + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## StableDiffusionModelEditingPipeline [[autodoc]] StableDiffusionModelEditingPipeline diff --git a/docs/source/en/api/pipelines/overview.mdx b/docs/source/en/api/pipelines/overview.mdx index a038a5753c58..2467b143d5dc 100644 --- a/docs/source/en/api/pipelines/overview.mdx +++ b/docs/source/en/api/pipelines/overview.mdx @@ -12,13 +12,13 @@ specific language governing permissions and limitations under the License. # Pipelines -Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible, and they can be adapted to use different scheduler or even model components. +Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different scheduler or even model components. All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. -Pipelines do not offer any training functionality. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead! +Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead! diff --git a/docs/source/en/api/pipelines/paint_by_example.mdx b/docs/source/en/api/pipelines/paint_by_example.mdx index 0b78f6803ea2..ec7172060926 100644 --- a/docs/source/en/api/pipelines/paint_by_example.mdx +++ b/docs/source/en/api/pipelines/paint_by_example.mdx @@ -20,6 +20,16 @@ The abstract from the paper is: The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example). +## Tips + +PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## PaintByExamplePipeline [[autodoc]] PaintByExamplePipeline - all diff --git a/docs/source/en/api/pipelines/panorama.mdx b/docs/source/en/api/pipelines/panorama.mdx index f9c797227a18..a0ad0d326188 100644 --- a/docs/source/en/api/pipelines/panorama.mdx +++ b/docs/source/en/api/pipelines/panorama.mdx @@ -18,30 +18,34 @@ The abstract from the paper is: *Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.* -You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [paper](https://arxiv.org/abs/2302.08113), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). +You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). ## Tips While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. For some GPUs with high performance, this can speedup the generation process and increase VRAM usage. - +To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default. Circular padding is applied to ensure there are no stitching artifacts when working with -panoramas that needs to seamlessly transition from the rightmost part to the leftmost part. +panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set `circular_padding=True`), the operation applies additional crops after the rightmost point of the image, allowing the model to "see” the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree -panorama viewers. When decoding latents in StableDiffusion, circular padding is applied +panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space. -Without circular padding, there is a stitching artifact (default): +For example, without circular padding, there is a stitching artifact (default): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png) -With circular padding, the right and the left parts are matching (`circular_padding=True`): +But with circular padding, the right and the left parts are matching (`circular_padding=True`): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png) + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + ## StableDiffusionPanoramaPipeline diff --git a/docs/source/en/api/pipelines/paradigms.mdx b/docs/source/en/api/pipelines/paradigms.mdx index 62504adb17aa..a56c02e70af3 100644 --- a/docs/source/en/api/pipelines/paradigms.mdx +++ b/docs/source/en/api/pipelines/paradigms.mdx @@ -39,6 +39,12 @@ by setting `parallel=80` and `tolerance=0.1`. 🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs. + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## StableDiffusionParadigmsPipeline [[autodoc]] StableDiffusionParadigmsPipeline - __call__ diff --git a/docs/source/en/api/pipelines/pix2pix.mdx b/docs/source/en/api/pipelines/pix2pix.mdx index d825ab4a6ed8..08990048e80b 100644 --- a/docs/source/en/api/pipelines/pix2pix.mdx +++ b/docs/source/en/api/pipelines/pix2pix.mdx @@ -18,7 +18,13 @@ The abstract from the paper is: *We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.* -You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [paper](https://huggingface.co/papers/2211.09800), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). +You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## StableDiffusionInstructPix2PixPipeline [[autodoc]] StableDiffusionInstructPix2PixPipeline diff --git a/docs/source/en/api/pipelines/pix2pix_zero.mdx b/docs/source/en/api/pipelines/pix2pix_zero.mdx index 2502d4d57209..9d43667c068b 100644 --- a/docs/source/en/api/pipelines/pix2pix_zero.mdx +++ b/docs/source/en/api/pipelines/pix2pix_zero.mdx @@ -18,7 +18,7 @@ The abstract from the paper is: *Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.* -You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/), [paper](https://arxiv.org/abs/2302.03027), [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo). +You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/), [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo). ## Tips diff --git a/docs/source/en/api/pipelines/pndm.mdx b/docs/source/en/api/pipelines/pndm.mdx index f4f6bd311278..0cb4799b3c81 100644 --- a/docs/source/en/api/pipelines/pndm.mdx +++ b/docs/source/en/api/pipelines/pndm.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [luping-liu/PNDM](https://github.com/luping-liu/PNDM). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## PNDMPipeline [[autodoc]] PNDMPipeline - all diff --git a/docs/source/en/api/pipelines/repaint.mdx b/docs/source/en/api/pipelines/repaint.mdx index 72b4a32e116c..9529893c354b 100644 --- a/docs/source/en/api/pipelines/repaint.mdx +++ b/docs/source/en/api/pipelines/repaint.mdx @@ -21,6 +21,13 @@ RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at l The original codebase can be found at [andreas128/RePaint](https://github.com/andreas128/RePaint). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + + ## RePaintPipeline [[autodoc]] RePaintPipeline - all diff --git a/docs/source/en/api/pipelines/score_sde_ve.mdx b/docs/source/en/api/pipelines/score_sde_ve.mdx index 29332b1b663c..4d95e6ec9e4a 100644 --- a/docs/source/en/api/pipelines/score_sde_ve.mdx +++ b/docs/source/en/api/pipelines/score_sde_ve.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [yang-song/score_sde_pytorch](https://github.com/yang-song/score_sde_pytorch). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## ScoreSdeVePipeline [[autodoc]] ScoreSdeVePipeline - all diff --git a/docs/source/en/api/pipelines/self_attention_guidance.mdx b/docs/source/en/api/pipelines/self_attention_guidance.mdx index 30718e69e9fc..854505f18202 100644 --- a/docs/source/en/api/pipelines/self_attention_guidance.mdx +++ b/docs/source/en/api/pipelines/self_attention_guidance.mdx @@ -18,7 +18,13 @@ The abstract from the paper is: *Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.* -You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [paper](https://arxiv.org/abs/2210.00939), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). +You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## StableDiffusionSAGPipeline [[autodoc]] StableDiffusionSAGPipeline diff --git a/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx b/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx index 3b0898b72b48..1435df551235 100644 --- a/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx +++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx @@ -19,6 +19,12 @@ The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## SemanticStableDiffusionPipeline [[autodoc]] SemanticStableDiffusionPipeline - all diff --git a/docs/source/en/api/pipelines/shap_e.mdx b/docs/source/en/api/pipelines/shap_e.mdx index bb971ac869e6..39f6416b18be 100644 --- a/docs/source/en/api/pipelines/shap_e.mdx +++ b/docs/source/en/api/pipelines/shap_e.mdx @@ -17,6 +17,12 @@ The abstract from the paper is: The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## Usage Examples In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format. diff --git a/docs/source/en/api/pipelines/spectrogram_diffusion.mdx b/docs/source/en/api/pipelines/spectrogram_diffusion.mdx index bf662585e099..70c64ca5c904 100644 --- a/docs/source/en/api/pipelines/spectrogram_diffusion.mdx +++ b/docs/source/en/api/pipelines/spectrogram_diffusion.mdx @@ -22,6 +22,12 @@ The original codebase can be found at [magenta/music-spectrogram-diffusion](http As depicted above the model takes as input a MIDI file and tokenizes it into a sequence of 5 second intervals. Each tokenized interval then together with positional encodings is passed through the Note Encoder and its representation is concatenated with the previous window's generated spectrogram representation obtained via the Context Encoder. For the initial 5 second window this is set to zero. The resulting context is then used as conditioning to sample the denoised Spectrogram from the MIDI window and we concatenate this spectrogram to the final output as well as use it for the context of the next MIDI window. The process repeats till we have gone over all the MIDI inputs. Finally a MelGAN decoder converts the potentially long spectrogram to audio which is the final result of this pipeline. + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## SpectrogramDiffusionPipeline [[autodoc]] SpectrogramDiffusionPipeline - all diff --git a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx index 768d13b5df6e..09814f387b72 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx @@ -10,11 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Depth-to-Image +# Depth-to-image The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. -The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) and additional official checkpoints for depth-to-image can be found at [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth). + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + ## StableDiffusionDepth2ImgPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx index 38e02fa6652e..4895ababf5bd 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx @@ -10,11 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Image Variation +# Image variation -The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model from [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/). +The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/). -The original codebase can be found at [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). +The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). + + + +Make sure to check out the Stable Diffusion [Tips](./overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + ## StableDiffusionImageVariationPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx index 85c167671204..b3de84c0f4eb 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx @@ -10,16 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Image-to-Image +# Image-to-image -The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. The original codebase can be found at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion/blob/main/scripts/img2img.py). +The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. -The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon). +The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon. The abstract from the paper is: *Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.* + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + + ## StableDiffusionImg2ImgPipeline [[autodoc]] StableDiffusionImg2ImgPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx index b621e015d93f..dc935d0bd17b 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx @@ -10,26 +10,22 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Inpaint +# Inpainting The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. -You can find the original codebases for the inpainting models in the following repositories: - -| Stable Diffusion version | Repository | -|--------------------------|------------------------------------------------------------------------------------------------------------------------| -| v1 | [CompVis/stable-diffusion](https://github.com/runwayml/stable-diffusion#inpainting-with-stable-diffusion) | -| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-inpainting-with-stable-diffusion) | - -Additional official checkpoints for different versions of the Stable Diffusion model for inpainting can be found on the [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. - - +## Tips It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default text-to-image Stable Diffusion checkpoints, such as -[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible with -this pipeline but might be less performant. +[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible but they might be less performant. + + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx index 9f91eb97ce5e..0775485e68db 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx @@ -10,11 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Latent Upscaler +# Latent upscaler -The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2. +The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2 (see this demo [notebook](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) for a demonstration of the original implementation). -The [Stable Diffusion Upscaler Demo](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) demonstrates the original implementation. + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + ## StableDiffusionLatentUpscalePipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx index 6c9b2a9028ec..141867d28922 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# LDM3D +# Text-to-(RGB, depth) LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./stable_diffusion/overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. @@ -18,6 +18,12 @@ The abstract from the paper is: *This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + + ## StableDiffusionLDM3DPipeline [[autodoc]] StableDiffusionLDM3DPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx index 02f42c307d6f..82b2597a7043 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx @@ -12,60 +12,169 @@ specific language governing permissions and limitations under the License. # Stable Diffusion pipelines -Stable Diffusion is a text-to-image _latent diffusion_ model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -Stable Diffusion is trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. - -For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and [our own blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. - -You can find the original codebases for the different Stable Diffusion versions in the following repositories: - -| Stable Diffusion version | Repository | -|--------------------------|---------------------------------------------------------------------------------| -| v1 | [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) | -| v2 | [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) | - -Additional official checkpoints for different versions of the Stable Diffusion model for different tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) organizations on the Hub. Feel free to explore these organizations to find the best pipeline for your use-case! +Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. + +Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. + +For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. + +You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case! + +The table below summarizes the available Stable Diffusion pipelines, their supported tasks, and an interactive demo: + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Pipeline + + Supported tasks + + Space +
+ StableDiffusion + text-to-image +
+ StableDiffusionImg2Img + image-to-image +
+ StableDiffusionInpaint + inpainting +
+ StableDiffusionDepth2Img + depth-to-image +
+ StableDiffusionImageVariation + image variation +
+ StableDiffusionPipelineSafe + filtered text-to-image +
+ StableDiffusion2 + text-to-image, inpainting, depth-to-image, super-resolution +
+ StableDiffusionXL + text-to-image, image-to-image +
+ StableDiffusionLatentUpscale + super-resolution +
+ StableDiffusionUpscale + super-resolution
+ StableDiffusionLDM3D + text-to-rgb, text-to-depth +
+
+
## Tips -[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible with the [`StableDiffusionPipeline`]. To try out a different scheduler: - - +To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines. -Check out the [Schedulers](../using-diffusers/schedulers) guide for more details about how to change and compare different schedulers. +### Explore tradeoff between speed and quality - +[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`] instead of the default: -```python ->>> from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler +```py +from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler ->>> pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) +pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") +pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) ->>> # or ->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") ->>> pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) +# or +euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") +pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) ``` -To save memory and use the same components across multiple pipelines, use the `.components` method: - - - -Read the reuse components across pipelines [section](../using-diffusers/loading#reuse-components-across-pipelines) for more details. +### Reuse pipeline components to save memory - +To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once. -```python ->>> from diffusers import ( -... StableDiffusionPipeline, -... StableDiffusionImg2ImgPipeline, -... StableDiffusionInpaintPipeline, -... ) +```py +from diffusers import ( + StableDiffusionPipeline, + StableDiffusionImg2ImgPipeline, + StableDiffusionInpaintPipeline, +) ->>> text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") ->>> img2img = StableDiffusionImg2ImgPipeline(**text2img.components) ->>> inpaint = StableDiffusionInpaintPipeline(**text2img.components) +text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") +img2img = StableDiffusionImg2ImgPipeline(**text2img.components) +inpaint = StableDiffusionInpaintPipeline(**text2img.components) ->>> # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline +# now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline ``` \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx index 51fe19eeebc6..d44e9f507830 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx @@ -19,13 +19,9 @@ These models are trained on an aesthetic subset of the [LAION-5B dataset](https: For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). - - -The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./stable_diffusion/text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler. +The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler. - - -Stable Diffusion 2 is available for a tasks like inpainting, super-resolution and depth-to-image: +Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: | Task | Repository | |-------------------------|---------------------------------------------------------------------------------------------------------------| @@ -33,4 +29,111 @@ Stable Diffusion 2 is available for a tasks like inpainting, super-resolution an | text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) | | inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) | | super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | -| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | \ No newline at end of file +| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | + +Here are some examples for how to use Stable Diffusion 2 for each task: + + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + + +## Text-to-image + +```py +from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler +import torch + +repo_id = "stabilityai/stable-diffusion-2-base" +pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") + +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) +pipe = pipe.to("cuda") + +prompt = "High quality photo of an astronaut riding a horse in space" +image = pipe(prompt, num_inference_steps=25).images[0] +image.save("astronaut.png") +``` + +## Inpainting + +```py +import PIL +import requests +import torch +from io import BytesIO + +from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler + + +def download_image(url): + response = requests.get(url) + return PIL.Image.open(BytesIO(response.content)).convert("RGB") + + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = download_image(img_url).resize((512, 512)) +mask_image = download_image(mask_url).resize((512, 512)) + +repo_id = "stabilityai/stable-diffusion-2-inpainting" +pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") + +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) +pipe = pipe.to("cuda") + +prompt = "Face of a yellow cat, high resolution, sitting on a park bench" +image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] + +image.save("yellow_cat.png") +``` + +## Super-resolution + +```py +import requests +from PIL import Image +from io import BytesIO +from diffusers import StableDiffusionUpscalePipeline +import torch + +# load model and scheduler +model_id = "stabilityai/stable-diffusion-x4-upscaler" +pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16) +pipeline = pipeline.to("cuda") + +# let's download an image +url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" +response = requests.get(url) +low_res_img = Image.open(BytesIO(response.content)).convert("RGB") +low_res_img = low_res_img.resize((128, 128)) +prompt = "a white cat" +upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] +upscaled_image.save("upsampled_cat.png") +``` + +## Depth-to-image + +```py +import torch +import requests +from PIL import Image + +from diffusers import StableDiffusionDepth2ImgPipeline + +pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-depth", + torch_dtype=torch.float16, +).to("cuda") + + +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +init_image = Image.open(requests.get(url, stream=True).raw) +prompt = "two tigers" +n_propmt = "bad, deformed, ugly, bad anotomy" +image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0] +``` \ No newline at end of file diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx index 0800e6115810..217434c6b669 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx @@ -18,6 +18,8 @@ The abstract from the paper is: *Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.* +## Tips + Use the `safety_concept` property of [`StableDiffusionPipelineSafe`] to check and edit the current safety concept: ```python @@ -40,6 +42,12 @@ There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyC >>> out = pipeline(prompt=prompt, **SafetyConfig.MAX) ``` + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + + ## StableDiffusionPipelineSafe [[autodoc]] StableDiffusionPipelineSafe diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx index 37e7e30c4857..8d09602d8605 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx @@ -10,13 +10,21 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Text-to-Image +# Text-to-image The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. The abstract from the paper is: -*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .* +*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion.* + + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + ## StableDiffusionPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx index 82038fe36195..0bad9be0dcd4 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx +++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx @@ -10,11 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Super Resolution +# Super-resolution The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4. -The original codebase can be found at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-upscaling-with-stable-diffusion) and additional official checkpoints for super resolution can be found at [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). + + +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + ## StableDiffusionUpscalePipeline diff --git a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx index 40eb5bb51733..6dee2d382e3b 100644 --- a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx +++ b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx @@ -18,6 +18,12 @@ The abstract from the paper: *We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.* + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## KarrasVePipeline [[autodoc]] KarrasVePipeline - all diff --git a/docs/source/en/api/pipelines/unclip.mdx b/docs/source/en/api/pipelines/unclip.mdx index 8d536d768447..8e6977b01fdf 100644 --- a/docs/source/en/api/pipelines/unclip.mdx +++ b/docs/source/en/api/pipelines/unclip.mdx @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # UnCLIP -[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)). +[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The UnCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)). The abstract from the paper is following: @@ -17,6 +17,12 @@ The abstract from the paper is following: You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## UnCLIPPipeline [[autodoc]] UnCLIPPipeline - all diff --git a/docs/source/en/api/pipelines/versatile_diffusion.mdx b/docs/source/en/api/pipelines/versatile_diffusion.mdx index 51bf953e9de7..721e7b0246dc 100644 --- a/docs/source/en/api/pipelines/versatile_diffusion.mdx +++ b/docs/source/en/api/pipelines/versatile_diffusion.mdx @@ -22,12 +22,18 @@ The abstract from the paper is: You can load the more memory intensive "all-in-one" [`VersatileDiffusionPipeline`] that supports all the tasks or use the individual pipelines which are more memory efficient. -| **Pipeline** | **Supported tasks** | -|--------------------------------------------|-----------------------------------| -| `VersatileDiffusion` | all of the below | -| `VersatileDiffusionTextToImagePipeline` | text-to-image | -| `VersatileDiffusionImageVariationPipeline` | image variation | -| `VersatileDiffusionDualGuidedPipeline` | image-text dual guided generation | +| **Pipeline** | **Supported tasks** | +|------------------------------------------------------|-----------------------------------| +| [`VersatileDiffusionPipeline`] | all of the below | +| [`VersatileDiffusionTextToImagePipeline`] | text-to-image | +| [`VersatileDiffusionImageVariationPipeline`] | image variation | +| [`VersatileDiffusionDualGuidedPipeline`] | image-text dual guided generation | + + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## VersatileDiffusionPipeline [[autodoc]] VersatileDiffusionPipeline diff --git a/docs/source/en/api/pipelines/vq_diffusion.mdx b/docs/source/en/api/pipelines/vq_diffusion.mdx index 66614c5b177b..5441d1d579ff 100644 --- a/docs/source/en/api/pipelines/vq_diffusion.mdx +++ b/docs/source/en/api/pipelines/vq_diffusion.mdx @@ -20,6 +20,12 @@ The abstract from the paper is: The original codebase can be found at [microsoft/VQ-Diffusion](https://github.com/microsoft/VQ-Diffusion). + + +Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## VQDiffusionPipeline [[autodoc]] VQDiffusionPipeline - all diff --git a/src/diffusers/loaders.py b/src/diffusers/loaders.py index be40ae586dae..a46909a609fa 100644 --- a/src/diffusers/loaders.py +++ b/src/diffusers/loaders.py @@ -816,7 +816,8 @@ class LoraLoaderMixin: def load_lora_weights(self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], **kwargs): """ - Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into self.unet and self.text_encoder. + Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and + `self.text_encoder`. All kwargs are forwarded to `self.lora_state_dict`. @@ -831,8 +832,7 @@ def load_lora_weights(self, pretrained_model_name_or_path_or_dict: Union[str, Di Parameters: pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): See [`~loaders.LoraLoaderMixin.lora_state_dict`]. - - kwargs: + kwargs (`dict`, *optional*): See [`~loaders.LoraLoaderMixin.lora_state_dict`]. """ state_dict, network_alpha = self.lora_state_dict(pretrained_model_name_or_path_or_dict, **kwargs) @@ -1171,10 +1171,10 @@ def save_lora_weights( save_directory (`str` or `os.PathLike`): Directory to save LoRA parameters to. Will be created if it doesn't exist. unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): - State dict of the LoRA layers corresponding to the UNet. - text_encoder_lora_layers (`Dict[str, torch.nn.Module] or `Dict[str, torch.Tensor]`): + State dict of the LoRA layers corresponding to the `unet`. + text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text - encoder LoRA state dict because it comes 🤗 Transformers. + encoder LoRA state dict because it comes from 🤗 Transformers. is_main_process (`bool`, *optional*, defaults to `True`): Whether the process calling this is the main process or not. Useful during distributed training and you need to call this function on all processes. In this case, set `is_main_process=True` only on the main @@ -1353,7 +1353,7 @@ def from_single_file(cls, pretrained_model_link_or_path, **kwargs): A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to True, the model + Whether to only load local model weights and configuration files or not. If set to `True`, the model won't be downloaded from the Hub. use_auth_token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from @@ -1367,7 +1367,7 @@ def from_single_file(cls, pretrained_model_link_or_path, **kwargs): weights. If set to `False`, safetensors weights are not loaded. extract_ema (`bool`, *optional*, defaults to `False`): Whether to extract the EMA weights or not. Pass `True` to extract the EMA weights which usually yield - higher quality images for inference. Non-EMA weights are usually better to continue finetuning. + higher quality images for inference. Non-EMA weights are usually better for continuing finetuning. upcast_attention (`bool`, *optional*, defaults to `None`): Whether the attention computation should always be upcasted. image_size (`int`, *optional*, defaults to 512): @@ -1377,23 +1377,19 @@ def from_single_file(cls, pretrained_model_link_or_path, **kwargs): The prediction type the model was trained on. Use `'epsilon'` for all Stable Diffusion v1 models and the Stable Diffusion v2 base model. Use `'v_prediction'` for Stable Diffusion v2. num_in_channels (`int`, *optional*, defaults to `None`): - The number of input channels. If `None`, it will be automatically inferred. + The number of input channels. If `None`, it is automatically inferred. scheduler_type (`str`, *optional*, defaults to `"pndm"`): Type of scheduler to use. Should be one of `["pndm", "lms", "heun", "euler", "euler-ancestral", "dpm", "ddim"]`. load_safety_checker (`bool`, *optional*, defaults to `True`): Whether to load the safety checker or not. - text_encoder (`CLIPTextModel`, *optional*, defaults to `None`): - An instance of - [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel) to use, - specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) - variant. If this parameter is `None`, the function will load a new instance of [CLIP] by itself, if - needed. - tokenizer (`CLIPTokenizer`, *optional*, defaults to `None`): - An instance of - [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer) - to use. If this parameter is `None`, the function will load a new instance of [CLIPTokenizer] by - itself, if needed. + text_encoder ([`~transformers.CLIPTextModel`], *optional*, defaults to `None`): + An instance of `CLIPTextModel` to use, specifically the + [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. If this + parameter is `None`, the function loads a new instance of `CLIPTextModel` by itself if needed. + tokenizer ([`~transformers.CLIPTokenizer`], *optional*, defaults to `None`): + An instance of `CLIPTokenizer` to use. If this parameter is `None`, the function loads a new instance + of `CLIPTokenizer` by itself if needed. kwargs (remaining dictionary of keyword arguments, *optional*): Can be used to overwrite load and saveable variables (for example the pipeline components of the specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py index 480596bda5b3..e8d59a582cd7 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py @@ -74,29 +74,30 @@ class AltDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`RobertaSeriesModelWithTransformation`]): + text_encoder ([`~transformers.RobertaSeriesModelWithTransformation`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`XLMRobertaTokenizer`): - A [`~transformers.XLMRobertaTokenizer`] to tokenize text. + tokenizer ([`~transformers.XLMRobertaTokenizer`]): + A `XLMRobertaTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] diff --git a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py index 8eb4a68ee99c..1c6aa3d74465 100644 --- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py +++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py @@ -104,29 +104,30 @@ class AltDiffusionImg2ImgPipeline( This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`RobertaSeriesModelWithTransformation`]): + text_encoder ([`~transformers.RobertaSeriesModelWithTransformation`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`XLMRobertaTokenizer`): - A [`~transformers.XLMRobertaTokenizer`] to tokenize text. + tokenizer ([`~transformers.XLMRobertaTokenizer`]): + A `XLMRobertaTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -590,7 +591,7 @@ def __call__( The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): `Image` or tensor representing an image batch to be used as the starting point. Can also accept image - latents as `image`, if passing latents directly, it will not be encoded again. + latents as `image`, but if passing latents directly it is not encoded again. strength (`float`, *optional*, defaults to 0.8): Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a starting point and more noise is added the higher the `strength`. The number of denoising steps depends diff --git a/src/diffusers/pipelines/audio_diffusion/mel.py b/src/diffusers/pipelines/audio_diffusion/mel.py index 4bf19ee13215..38a11cdaab7d 100644 --- a/src/diffusers/pipelines/audio_diffusion/mel.py +++ b/src/diffusers/pipelines/audio_diffusion/mel.py @@ -127,12 +127,12 @@ def get_audio_slice(self, slice: int = 0) -> np.ndarray: Returns: `np.ndarray`: - The audio slice as a NumPy array + The audio slice as a NumPy array. """ return self.audio[self.slice_size * slice : self.slice_size * (slice + 1)] def get_sample_rate(self) -> int: - """Get sample rate: + """Get sample rate. Returns: `int`: diff --git a/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py b/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py index 107e02a34ecb..74737560cd8e 100644 --- a/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py +++ b/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py @@ -38,12 +38,12 @@ class AudioDiffusionPipeline(DiffusionPipeline): vqae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. mel ([`Mel`]): Transform audio into a spectrogram. - scheduler ([`DDIMScheduler` or `DDPMScheduler`]): + scheduler ([`DDIMScheduler`] or [`DDPMScheduler`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - [`DDIMScheduler` or `DDPMScheduler`]. + [`DDIMScheduler`] or [`DDPMScheduler`]. """ _optional_components = ["vqvae"] @@ -59,10 +59,11 @@ def __init__( self.register_modules(unet=unet, scheduler=scheduler, mel=mel, vqvae=vqvae) def get_default_steps(self) -> int: - """Returns default number of steps recommended for inference + """Returns default number of steps recommended for inference. Returns: - `int`: number of steps + `int`: + The number of steps. """ return 50 if isinstance(self.scheduler, DDIMScheduler) else 1000 @@ -119,9 +120,9 @@ def __call__( noise (`torch.Tensor`): A noise tensor of shape `(batch_size, 1, height, width)` or `None`. encoding (`torch.Tensor`): - for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim) + A tensor for [`UNet2DConditionModel`] of shape `(batch_size, seq_length, cross_attention_dim)`. return_dict (`bool`): - Whether or not to return a [`AudioPipelineOutput`], [`ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`AudioPipelineOutput`], [`ImagePipelineOutput`] or a plain tuple. Examples: @@ -275,7 +276,7 @@ def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray: images (`List[PIL Image]`): List of images to encode. steps (`int`): - Number of encoding steps to perform (defaults to `50`) + Number of encoding steps to perform (defaults to `50`). Returns: `np.ndarray`: @@ -309,7 +310,7 @@ def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray: @staticmethod def slerp(x0: torch.Tensor, x1: torch.Tensor, alpha: float) -> torch.Tensor: - """Spherical Linear intERPolation + """Spherical Linear intERPolation. Args: x0 (`torch.Tensor`): diff --git a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py index 1286ef5c14f0..1816972e3bb1 100644 --- a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py +++ b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py @@ -58,18 +58,18 @@ class AudioLDMPipeline(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`ClapTextModelWithProjection`]): - Frozen text-encoder ([`~transformers.ClapTextModelWithProjection`], specifically the + text_encoder ([`~transformers.ClapTextModelWithProjection`]): + Frozen text-encoder (`ClapTextModelWithProjection`, specifically the [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant. tokenizer ([`PreTrainedTokenizer`]): A [`~transformers.RobertaTokenizer`] to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded audio latents. + A `UNet2DConditionModel` to denoise the encoded audio latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. - vocoder ([`SpeechT5HifiGan`]): - Vocoder of class [`~transformers.SpeechT5HifiGan`]. + vocoder ([`~transformers.SpeechT5HifiGan`]): + Vocoder of class `SpeechT5HifiGan`. """ def __init__( diff --git a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py index 55772a83ca78..7fe1d12a251f 100644 --- a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py +++ b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py @@ -57,7 +57,7 @@ class ConsistencyModelPipeline(DiffusionPipeline): Args: unet ([`UNet2DModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Currently only compatible with [`CMStochasticIterativeScheduler`]. diff --git a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py index d4f3887b6035..f7d2ea4fba62 100644 --- a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py +++ b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py @@ -33,7 +33,7 @@ class DanceDiffusionPipeline(DiffusionPipeline): Parameters: unet ([`UNet1DModel`]): - A [`UNet1DModel`] to denoise the encoded audio. + A `UNet1DModel` to denoise the encoded audio. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of [`IPNDMScheduler`]. @@ -72,7 +72,6 @@ def __call__( Example: ```py - # !pip install diffusers[torch] accelerate scipy from diffusers import DiffusionPipeline from scipy.io.wavfile import write diff --git a/src/diffusers/pipelines/ddim/pipeline_ddim.py b/src/diffusers/pipelines/ddim/pipeline_ddim.py index 06dbf1d525c5..6eae78f2801e 100644 --- a/src/diffusers/pipelines/ddim/pipeline_ddim.py +++ b/src/diffusers/pipelines/ddim/pipeline_ddim.py @@ -30,7 +30,7 @@ class DDIMPipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of [`DDPMScheduler`], or [`DDIMScheduler`]. @@ -82,7 +82,6 @@ def __call__( Example: ```py - >>> # !pip install diffusers >>> from diffusers import DDIMPipeline >>> import PIL.Image >>> import numpy as np diff --git a/src/diffusers/pipelines/ddpm/pipeline_ddpm.py b/src/diffusers/pipelines/ddpm/pipeline_ddpm.py index ef62243501dc..1e9ead0f3d39 100644 --- a/src/diffusers/pipelines/ddpm/pipeline_ddpm.py +++ b/src/diffusers/pipelines/ddpm/pipeline_ddpm.py @@ -30,7 +30,7 @@ class DDPMPipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of [`DDPMScheduler`], or [`DDIMScheduler`]. @@ -58,7 +58,7 @@ def __call__( generator (`torch.Generator`, *optional*): A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic. - num_inference_steps (`int`, *optional*, defaults to 50): + num_inference_steps (`int`, *optional*, defaults to 1000): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. output_type (`str`, *optional*, defaults to `"pil"`): @@ -69,7 +69,6 @@ def __call__( Example: ```py - >>> # !pip install diffusers >>> from diffusers import DDPMPipeline >>> # load model and scheduler diff --git a/src/diffusers/pipelines/dit/pipeline_dit.py b/src/diffusers/pipelines/dit/pipeline_dit.py index 5efd86d88aca..d57f13c2991a 100644 --- a/src/diffusers/pipelines/dit/pipeline_dit.py +++ b/src/diffusers/pipelines/dit/pipeline_dit.py @@ -37,7 +37,7 @@ class DiTPipeline(DiffusionPipeline): Parameters: transformer ([`Transformer2DModel`]): - A [`Transformer2DModel`] to denoise the encoded image latents. + A class conditioned `Transformer2DModel` to denoise the encoded image latents. vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. scheduler ([`DDIMScheduler`]): diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py index 46d7868b6936..958d7750884a 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py @@ -41,10 +41,10 @@ class LDMTextToImagePipeline(DiffusionPipeline): Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): Text-encoder model based on [`~transformers.BERT`]. - tokenizer (`transformers.BertTokenizer`): - A [`~transformers.BertTokenizer`] to tokenize text. + tokenizer ([`~transformers.BertTokenizer`]): + A `BertTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -108,7 +108,6 @@ def __call__( Example: ```py - >>> # !pip install diffusers transformers >>> from diffusers import DiffusionPipeline >>> # load model and scheduler diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py index 09bef71497b5..f6806af3c37e 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py @@ -31,7 +31,7 @@ def preprocess(image): class LDMSuperResolutionPipeline(DiffusionPipeline): r""" - A pipeline for image super-resolution using latent diffusion + A pipeline for image super-resolution using latent diffusion. This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). @@ -40,7 +40,7 @@ class LDMSuperResolutionPipeline(DiffusionPipeline): vqvae ([`VQModel`]): Vector-quantized (VQ) model to encode and decode images to and from latent representations. unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image. + A `UNet2DModel` to denoise the encoded image. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latens. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], @@ -99,7 +99,6 @@ def __call__( Example: ```py - >>> #!pip install git+https://github.com/huggingface/diffusers.git >>> import requests >>> from PIL import Image >>> from io import BytesIO diff --git a/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py b/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py index b58c466cfb60..be130a74c28c 100644 --- a/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py +++ b/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py @@ -34,7 +34,7 @@ class LDMPipeline(DiffusionPipeline): vqvae ([`VQModel`]): Vector-quantized (VQ) model to encode and decode images to and from latent representations. unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): [`DDIMScheduler`] is used in combination with `unet` to denoise the encoded image latents. """ @@ -69,12 +69,11 @@ def __call__( output_type (`str`, *optional*, defaults to `"pil"`): The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. Example: ```py - >>> # !pip install diffusers transformers >>> from diffusers import LDMPipeline >>> # load model and scheduler diff --git a/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py b/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py index 9962d112bd66..09b225b06581 100644 --- a/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py +++ b/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py @@ -136,7 +136,7 @@ def prepare_mask_and_masked_image(image, mask): class PaintByExamplePipeline(DiffusionPipeline): r""" - 🧪 This is an experimental feature! @@ -151,11 +151,11 @@ class PaintByExamplePipeline(DiffusionPipeline): vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. image_encoder ([`PaintByExampleImageEncoder`]): - Encodes the example input image. The UNet is conditioned on the example image instead of a text prompt. - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + Encodes the example input image. The `unet` is conditioned on the example image instead of a text prompt. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. @@ -163,8 +163,9 @@ class PaintByExamplePipeline(DiffusionPipeline): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. + """ # TODO: feature_extractor is required to encode initial images (if they are in PIL format), # we should give a descriptive message if the pipeline doesn't have one. @@ -438,8 +439,6 @@ def __call__( Example: ```py - >>> # !pip install diffusers transformers - >>> import PIL >>> import requests >>> import torch diff --git a/src/diffusers/pipelines/pipeline_flax_utils.py b/src/diffusers/pipelines/pipeline_flax_utils.py index f5e7880da1cd..21fbc36c610a 100644 --- a/src/diffusers/pipelines/pipeline_flax_utils.py +++ b/src/diffusers/pipelines/pipeline_flax_utils.py @@ -97,7 +97,7 @@ class FlaxDiffusionPipeline(ConfigMixin): [`FlaxDiffusionPipeline`] stores all components (models, schedulers, and processors) for diffusion pipelines and provides methods for loading, downloading and saving models. It also includes methods to: - - enabling/disabling the progress bar for the denoising iteration + - enable/disable the progress bar for the denoising iteration Class attributes: @@ -191,9 +191,9 @@ class implements both a save and loading method. The pipeline is easily reloaded @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], **kwargs): r""" - Instantiate a Flax diffusion pipeline from pretrained pipeline weights. + Instantiate a Flax-based diffusion pipeline from pretrained pipeline weights. - The pipeline is set in evaluation mode - `model.eval()` - by default, and dropout modules are deactivated. + The pipeline is set in evaluation mode (`model.eval()) by default and dropout modules are deactivated. If you get the error message below, you need to finetune the weights for your downstream task: @@ -228,7 +228,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P won't be downloaded from the Hub. use_auth_token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in ~/.huggingface) is used. + `diffusers-cli login` (stored in `~/.huggingface`) is used. revision (`str`, *optional*, defaults to `"main"`): The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier allowed by Git. diff --git a/src/diffusers/pipelines/pipeline_utils.py b/src/diffusers/pipelines/pipeline_utils.py index 3d827596d508..30ef1a7daa41 100644 --- a/src/diffusers/pipelines/pipeline_utils.py +++ b/src/diffusers/pipelines/pipeline_utils.py @@ -463,13 +463,13 @@ class DiffusionPipeline(ConfigMixin): provides methods for loading, downloading and saving models. It also includes methods to: - move all PyTorch modules to the device of your choice - - enabling/disabling the progress bar for the denoising iteration + - enable/disable the progress bar for the denoising iteration Class attributes: - **config_name** (`str`) -- The configuration filename that stores the class and module names of all the diffusion pipeline's components. - - **_optional_components** (List[`str`]) -- List of all optional components that don't have to be passed to the + - **_optional_components** (`List[str]`) -- List of all optional components that don't have to be passed to the pipeline to function (should be overridden by subclasses). """ config_name = "model_index.json" @@ -1475,10 +1475,9 @@ def set_progress_bar_config(self, **kwargs): def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None): r""" - Enable memory efficient attention from [xFormers](https://facebookresearch.github.io/xformers/). - - When this option is enabled, you should observe lower GPU memory usage and a potential speed up during - inference. Speed up during training is not guaranteed. + Enable memory efficient attention from [xFormers](https://facebookresearch.github.io/xformers/). When this + option is enabled, you should observe lower GPU memory usage and a potential speed up during inference. Speed + up during training is not guaranteed. @@ -1537,10 +1536,9 @@ def fn_recursive_set_mem_eff(module: torch.nn.Module): def enable_attention_slicing(self, slice_size: Optional[Union[str, int]] = "auto"): r""" - Enable sliced attention computation. - - When this option is enabled, the attention module splits the input tensor in slices to compute attention in - several steps. This is useful to save some memory in exchange for a small speed decrease. + Enable sliced attention computation. When this option is enabled, the attention module splits the input tensor + in slices to compute attention in several steps. This is useful to save some memory in exchange for a small + speed decrease. Args: slice_size (`str` or `int`, *optional*, defaults to `"auto"`): diff --git a/src/diffusers/pipelines/pndm/pipeline_pndm.py b/src/diffusers/pipelines/pndm/pipeline_pndm.py index ffe7f8b3b94d..747747b6f001 100644 --- a/src/diffusers/pipelines/pndm/pipeline_pndm.py +++ b/src/diffusers/pipelines/pndm/pipeline_pndm.py @@ -32,9 +32,9 @@ class PNDMPipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`PNDMScheduler`]): - A [`PNDMScheduler`] to be used in combination with `unet` to denoise the encoded image. + A `PNDMScheduler` to be used in combination with `unet` to denoise the encoded image. """ unet: UNet2DModel @@ -66,7 +66,7 @@ def __call__( num_inference_steps (`int`, `optional`, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - generator (`torch.Generator`, `optional`): A [torch + generator (`torch.Generator`, `optional`): A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic. output_type (`str`, `optional`, defaults to `"pil"`): @@ -77,7 +77,6 @@ def __call__( Example: ```py - >>> # !pip install diffusers >>> from diffusers import PNDMPipeline >>> # load model and scheduler diff --git a/src/diffusers/pipelines/repaint/pipeline_repaint.py b/src/diffusers/pipelines/repaint/pipeline_repaint.py index d9768ea85226..038f9280c782 100644 --- a/src/diffusers/pipelines/repaint/pipeline_repaint.py +++ b/src/diffusers/pipelines/repaint/pipeline_repaint.py @@ -85,9 +85,9 @@ class RePaintPipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image latents. + A `UNet2DModel` to denoise the encoded image latents. scheduler ([`RePaintScheduler`]): - A [`RePaintScheduler`] to be used in combination with `unet` to denoise the encoded image. + A `RePaintScheduler` to be used in combination with `unet` to denoise the encoded image. """ unet: UNet2DModel @@ -117,7 +117,7 @@ def __call__( image (`torch.FloatTensor` or `PIL.Image.Image`): The original image to inpaint on. mask_image (`torch.FloatTensor` or `PIL.Image.Image`): - The mask_image where `0.0` define which part of the original image to inpaint. + The mask_image where 0.0 define which part of the original image to inpaint. num_inference_steps (`int`, *optional*, defaults to 1000): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. @@ -126,10 +126,10 @@ def __call__( DDIM and 1.0 is the DDPM scheduler. jump_length (`int`, *optional*, defaults to 10): The number of steps taken forward in time before going backward in time for a single jump ("j" in - RePaint paper). Take a look at Figure 9 and 10 in https://arxiv.org/pdf/2201.09865.pdf. + RePaint paper). Take a look at Figure 9 and 10 in the [paper](https://arxiv.org/pdf/2201.09865.pdf). jump_n_sample (`int`, *optional*, defaults to 10): The number of times to make a forward time jump for a given chosen time sample. Take a look at Figure 9 - and 10 in https://arxiv.org/pdf/2201.09865.pdf. + and 10 in the [paper](https://arxiv.org/pdf/2201.09865.pdf). generator (`torch.Generator`, *optional*): A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic. diff --git a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py index 2c171b611581..69aec5b60a44 100644 --- a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py +++ b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py @@ -31,9 +31,9 @@ class ScoreSdeVePipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image. + A `UNet2DModel` to denoise the encoded image. scheduler ([`ScoreSdeVeScheduler`]): - A [`ScoreSdeVeScheduler`] scheduler to be used in combination with `unet` to denoise the encoded image. + A `ScoreSdeVeScheduler` to be used in combination with `unet` to denoise the encoded image. """ unet: UNet2DModel scheduler: ScoreSdeVeScheduler @@ -58,7 +58,7 @@ def __call__( Args: batch_size (`int`, *optional*, defaults to 1): The number of images to generate. - generator (`torch.Generator`, `optional`): A [torch + generator (`torch.Generator`, `optional`): A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic. output_type (`str`, `optional`, defaults to `"pil"`): @@ -66,12 +66,6 @@ def __call__( return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. - Example: - - ```py - - ``` - Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` diff --git a/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py b/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py index 7c961cc53b83..13d1ffd6aa16 100644 --- a/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py +++ b/src/diffusers/pipelines/semantic_stable_diffusion/__init__.py @@ -20,7 +20,7 @@ class SemanticStableDiffusionPipelineOutput(BaseOutput): num_channels)`. nsfw_content_detected (`List[bool]`) List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content or - None if safety checking could not be performed. + `None` if safety checking could not be performed. """ images: Union[List[PIL.Image.Image], np.ndarray] diff --git a/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py b/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py index ec5eebad4cc0..29082beb9128 100644 --- a/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py +++ b/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py @@ -29,21 +29,21 @@ class SemanticStableDiffusionPipeline(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`Q16SafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -237,7 +237,7 @@ def __call__( Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + The prompt or prompts to guide image generation. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): @@ -278,7 +278,7 @@ def __call__( The prompt or prompts to use for semantic guidance. Semantic guidance is disabled by setting `editing_prompt = None`. Guidance direction of prompt should be specified via `reverse_editing_direction`. - editing_prompt_embeddings (`torch.Tensor>`, *optional*): + editing_prompt_embeddings (`torch.Tensor`, *optional*): Pre-computed embeddings to use for semantic guidance. Guidance direction of embedding should be specified via `reverse_editing_direction`. reverse_editing_direction (`bool` or `List[bool]`, *optional*, defaults to `False`): diff --git a/src/diffusers/pipelines/stable_diffusion/__init__.py b/src/diffusers/pipelines/stable_diffusion/__init__.py index 1219f44dc8eb..1fddb712e6a9 100644 --- a/src/diffusers/pipelines/stable_diffusion/__init__.py +++ b/src/diffusers/pipelines/stable_diffusion/__init__.py @@ -119,11 +119,11 @@ class FlaxStableDiffusionPipelineOutput(BaseOutput): Output class for Flax-based Stable Diffusion pipelines. Args: - images (`np.ndarray`) + images (`np.ndarray`): Denoised images of array shape of `(batch_size, height, width, num_channels)`. - nsfw_content_detected (`List[bool]`) - List indicating whether the corresponding generated image contains "not-safe-for-work" - (nsfw) content or `None` if safety checking could not be performed. + nsfw_content_detected (`List[bool]`): + List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content + or `None` if safety checking could not be performed. """ images: np.ndarray diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py index 61ce824457c9..78dd15e90f1e 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_cycle_diffusion.py @@ -136,21 +136,21 @@ class CycleDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can only be an + instance of [`DDIMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -601,7 +601,7 @@ def __call__( The prompt or prompts to guide the image generation. image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): `Image` or tensor representing an image batch to be used as the starting point. Can also accept image - latents as `image`, if passing latents directly, it will not be encoded again. + latents as `image`, but if passing latents directly it is not encoded again. strength (`float`, *optional*, defaults to 0.8): Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a starting point and more noise is added the higher the `strength`. The number of denoising steps depends diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py index 2eea5730999d..e1688426e636 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py @@ -80,7 +80,7 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): r""" - Pipeline for text-to-image generation using Stable Diffusion. + Flax-based pipeline for text-to-image generation using Stable Diffusion. This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). @@ -88,12 +88,12 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): Args: vae ([`FlaxAutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`FlaxCLIPTextModel`]): + text_encoder ([`~transformers.FlaxCLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`FlaxUNet2DConditionModel`]): - A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. + A `FlaxUNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or @@ -102,8 +102,8 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -327,7 +327,7 @@ def __call__( Args: prompt (`str` or `List[str]`, *optional*): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + The prompt or prompts to guide image generation. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): @@ -341,7 +341,7 @@ def __call__( latents (`jnp.array`, *optional*): Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor is generated by sampling using the supplied random `generator`. + array is generated by sampling using the supplied random `generator`. jit (`bool`, defaults to `False`): Whether to run `pmap` versions of the generation and safety scoring functions. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py index db5c97b14ffd..b4c0387ca01b 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_img2img.py @@ -104,7 +104,7 @@ class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline): r""" - Pipeline for text-guided image-to-image generation using Stable Diffusion. + Flax-based pipeline for text-guided image-to-image generation using Stable Diffusion. This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). @@ -114,19 +114,20 @@ class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`~transformers.FlaxCLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`FlaxUNet2DConditionModel`]): - A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. + A `FlaxUNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], or [`FlaxPNDMScheduler`]. + [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or + [`FlaxDPMSolverMultistepScheduler`]. safety_checker ([`FlaxStableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -355,13 +356,13 @@ def __call__( Args: prompt_ids (`jnp.array`): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + The prompt or prompts to guide image generation. image (`jnp.array`): Array representing an image batch to be used as the starting point. params (`Dict` or `FrozenDict`): - Dictionary containing the model parameters/weights + Dictionary containing the model parameters/weights. prng_seed (`jax.random.KeyArray` or `jax.Array`): - Array containing random number generator key + Array containing random number generator key. strength (`float`, *optional*, defaults to 0.8): Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a starting point and more noise is added the higher the `strength`. The number of denoising steps depends @@ -371,9 +372,9 @@ def __call__( num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated by `strength`. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. guidance_scale (`float`, *optional*, defaults to 7.5): A higher guidance scale value encourages the model to generate images closely linked to the text diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py index 3cdac8e29885..36d14423f322 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion_inpaint.py @@ -101,7 +101,7 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): r""" - Pipeline for text-guided image inpainting using Stable Diffusion. + Flax-based pipeline for text-guided image inpainting using Stable Diffusion. @@ -117,19 +117,20 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. text_encoder ([`~transformers.FlaxCLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`FlaxUNet2DConditionModel`]): - A [`FlaxUNet2DConditionModel`] to denoise the encoded image latents. + A `FlaxUNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], or [`FlaxPNDMScheduler`]. + [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or + [`FlaxDPMSolverMultistepScheduler`]. safety_checker ([`FlaxStableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ def __init__( @@ -412,10 +413,10 @@ def __call__( Args: prompt (`str` or `List[str]`): - The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + The prompt or prompts to guide image generation. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the @@ -426,7 +427,7 @@ def __call__( latents (`jnp.array`, *optional*): Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents - tensor is generated by sampling using the supplied random `generator`. + array is generated by sampling using the supplied random `generator`. jit (`bool`, defaults to `False`): Whether to run `pmap` versions of the generation and safety scoring functions. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py index 4baa3c51e8a7..1fcb88b78f14 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py @@ -76,29 +76,30 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py index 2e4cd6bc68f4..a95015a2b850 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py @@ -164,7 +164,7 @@ def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, a class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversionLoaderMixin): r""" - Pipeline for text-to-image generation using Stable Diffusion and Attend and Excite. + Pipeline for text-to-image generation using Stable Diffusion and Attend-and-Excite. This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). @@ -172,21 +172,21 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py index 6cc46d3ec11e..701526a3c154 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py @@ -69,23 +69,23 @@ class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoader This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - The pipeline also inherits the following loading and saving methods: - - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - [`~loaders.LoraLoaderMixin.load_lora_weights`] - - [`~loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. """ def __init__( diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py index efe065d0e2cc..4a1e6d50ab26 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_image_variation.py @@ -44,19 +44,23 @@ class StableDiffusionImageVariationPipeline(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - image_encoder ([`CLIPVisionModelWithProjection`]): + image_encoder ([`~transformers.CLIPVisionModelWithProjection`]): Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). + text_encoder ([`~transformers.CLIPTextModel`]): + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ # TODO: feature_extractor is required to encode images (if they are in PIL format), # we should give a descriptive message if the pipeline doesn't have one. @@ -259,9 +263,9 @@ def __call__( image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`): Image or images to guide image generation. If you provide a tensor, it needs to be compatible with [`CLIPImageProcessor`](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json). - height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py index cc89824772e8..744dea6fcf01 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py @@ -107,29 +107,30 @@ class StableDiffusionImg2ImgPipeline( This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -596,7 +597,7 @@ def __call__( The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): `Image` or tensor representing an image batch to be used as the starting point. Can also accept image - latents as `image`, if passing latents directly, it will not be encoded again. + latents as `image`, but if passing latents directly it is not encoded again. strength (`float`, *optional*, defaults to 0.8): Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a starting point and more noise is added the higher the `strength`. The number of denoising steps depends diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py index f5624fc0e63a..f977edb4a2d9 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py @@ -161,29 +161,29 @@ class StableDiffusionInpaintPipeline( This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - The pipeline also inherits the following loading and saving methods: - - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - [`~loaders.LoraLoaderMixin.load_lora_weights`] - - [`~loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights Args: vae ([`AutoencoderKL`, `AsymmetricAutoencoderKL`]): Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. text_encoder ([`CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -710,7 +710,7 @@ def __call__( The height in pixels of the generated image. width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. - strength (`float`, *optional*, defaults to 0.8): + strength (`float`, *optional*, defaults to 1.0): Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a starting point and more noise is added the higher the `strength`. The number of denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py index 6b12719c6f5b..d27f8a21f369 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py @@ -75,31 +75,29 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -180,7 +178,7 @@ def __call__( The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): `Image` or tensor representing an image batch to be repainted according to `prompt`. Can also accept - image latents as `image`, if passing latents directly, it will not be encoded again. + image latents as `image`, but if passing latents directly it is not encoded again. num_inference_steps (`int`, *optional*, defaults to 100): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py index 8692bbdd82dc..da1fa84b41d9 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py @@ -68,15 +68,14 @@ class StableDiffusionLatentUpscalePipeline(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `EulerDiscreteScheduler`]. + A [`EulerDiscreteScheduler`] to be used in combination with `unet` to denoise the encoded image latents. """ def __init__( diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py index a65ce0829048..88949c5f5e8c 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py @@ -85,32 +85,30 @@ class StableDiffusionLDM3DPipeline( This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -506,7 +504,7 @@ def __call__( num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (`float`, *optional*, defaults to 7.5): + guidance_scale (`float`, *optional*, defaults to 5.0): A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. negative_prompt (`str` or `List[str]`, *optional*): diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py index 46fa336d88db..3ff3b1f2329e 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py @@ -44,21 +44,22 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa Args: vae ([`AutoencoderKL`]): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): - A scheduler to be used in combination with `unet` to denoise the encoded image latents. + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPFeatureExtractor`]): + A `CLIPFeatureExtractor` to extract features from generated images; used as inputs to the `safety_checker`. with_to_k ([`bool`]): Whether to edit the key projection matrices along with the value projection matrices. with_augs ([`list`]): diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py index 69dcaaff4bad..f1f2e2d607db 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py @@ -58,31 +58,24 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - - - To generate panorama-like images make sure you pass the `width` parameter accordingly. We recommend a `width` value - of 2048 which is the default. - - - Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -487,8 +480,8 @@ def __call__( A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. view_batch_size (`int`, *optional*, defaults to 1): - The batch size to denoise splited views. For some GPUs with high performance, higher view batch size - can speedup the generation and increase the VRAM usage. + The batch size to denoise split views. For some GPUs with high performance, higher view batch size can + speedup the generation and increase the VRAM usage. negative_prompt (`str` or `List[str]`, *optional*): The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). @@ -526,7 +519,7 @@ def __call__( `self.processor` in [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). circular_padding (`bool`, *optional*, defaults to `False`): - If set to True, circular padding is applied to ensure there are no stitching artifacts. Circular + If set to `True`, circular padding is applied to ensure there are no stitching artifacts. Circular padding allows the model to seamlessly generate a transition from the rightmost part of the image to the leftmost part, maintaining consistency in a 360-degree sense. diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py index ce4793743bc1..45f1c78f8428 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py @@ -68,32 +68,30 @@ class StableDiffusionParadigmsPipeline( This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). - In addition the pipeline inherits the following loading methods: - - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - - as well as the following saving methods: - - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`] + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py index d26057fb7a1b..55940e66e8a9 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py @@ -100,21 +100,21 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin) Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py index 81886ebd6bb5..07918074a8ea 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_upscale.py @@ -75,14 +75,14 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. low_res_scheduler ([`SchedulerMixin`]): - A scheduler used to add initial noise to the low res conditioning image. It must be an instance of + A scheduler used to add initial noise to the low resolution conditioning image. It must be an instance of [`DDPMScheduler`]. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of @@ -521,9 +521,6 @@ def __call__( num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - num_inference_steps (`int`, *optional*, defaults to 50): - The number of denoising steps. More denoising steps usually lead to a higher quality image at the - expense of slower inference. guidance_scale (`float`, *optional*, defaults to 7.5): A higher guidance scale value encourages the model to generate images closely linked to the text `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. diff --git a/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py b/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py index de28a0aebfa3..32ebe5be57d0 100644 --- a/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py +++ b/src/diffusers/pipelines/stable_diffusion_safe/pipeline_stable_diffusion_safe.py @@ -29,21 +29,21 @@ class StableDiffusionPipelineSafe(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of - `DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. + [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ _optional_components = ["safety_checker", "feature_extractor"] @@ -493,7 +493,7 @@ def __call__( The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The height in pixels of the generated image. - width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): The width in pixels of the generated image. num_inference_steps (`int`, *optional*, defaults to 50): The number of denoising steps. More denoising steps usually lead to a higher quality image at the diff --git a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py index 48ec538730b4..5273120c9ab0 100644 --- a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py +++ b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py @@ -28,7 +28,7 @@ class KarrasVePipeline(DiffusionPipeline): Parameters: unet ([`UNet2DModel`]): - A [`UNet2DModel`] to denoise the encoded image. + A `UNet2DModel` to denoise the encoded image. scheduler ([`KarrasVeScheduler`]): A scheduler to be used in combination with `unet` to denoise the encoded image. """ @@ -70,10 +70,6 @@ def __call__( Example: - ```py - - ``` - Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` diff --git a/src/diffusers/pipelines/unclip/pipeline_unclip.py b/src/diffusers/pipelines/unclip/pipeline_unclip.py index 77fa2d1f8acd..a2de12e3f372 100644 --- a/src/diffusers/pipelines/unclip/pipeline_unclip.py +++ b/src/diffusers/pipelines/unclip/pipeline_unclip.py @@ -39,10 +39,10 @@ class UnCLIPPipeline(DiffusionPipeline): implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: - text_encoder ([`CLIPTextModelWithProjection`]): + text_encoder ([`~transformers.CLIPTextModelWithProjection`]): Frozen text-encoder. - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. prior ([`PriorTransformer`]): The canonical unCLIP prior to approximate the image embedding from the text embedding. text_proj ([`UnCLIPTextProjModel`]): diff --git a/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py b/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py index 820a94c51623..eb008a74e08c 100644 --- a/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py +++ b/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py @@ -37,19 +37,19 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline): """ - Pipeline for image-guided image generation using UnCLIP. + Pipeline to generate image variations from an input image using UnCLIP. This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.). Args: - text_encoder ([`CLIPTextModelWithProjection`]): + text_encoder ([`~transformers.CLIPTextModelWithProjection`]): Frozen text-encoder. - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. - feature_extractor ([`CLIPImageProcessor`]): + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. + feature_extractor ([`~transformers.CLIPImageProcessor`]): Model that extracts features from generated images to be used as inputs for the `image_encoder`. - image_encoder ([`CLIPVisionModelWithProjection`]): + image_encoder ([`~transformers.CLIPVisionModelWithProjection`]): Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). text_proj ([`UnCLIPTextProjModel`]): Utility class to prepare and combine the embeddings before they are passed to the decoder. diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py index 5a730c3ed890..131caa2e0cf7 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py @@ -27,21 +27,21 @@ class VersatileDiffusionPipeline(DiffusionPipeline): Args: vae ([`AutoencoderKL`]): Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. - safety_checker ([`StableDiffusionMegaSafetyChecker`]): + safety_checker ([`StableDiffusionSafetyChecker`]): Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details about a model's potential harms. - feature_extractor ([`CLIPImageProcessor`]): - A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. """ tokenizer: CLIPTokenizer diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py index 495462a3cfb8..12da760e14f3 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py @@ -50,10 +50,10 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline): Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): Text-encoder model based on [`~transformers.BERT`]. - tokenizer ([`transformers.BertTokenizer`]): - A [`transformers.BertTokenizer`]. + tokenizer ([`~transformers.BertTokenizer`]): + A `BertTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py index 7b4d77026ab1..80f028798631 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py @@ -44,10 +44,10 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline): Vector-quantized (VQ) model to encode and decode images to and from latent representations. bert ([`LDMBertModel`]): Text-encoder model based on [`~transformers.BERT`]. - tokenizer ([`transformers.BertTokenizer`]): - A [`transformers.BertTokenizer`]. + tokenizer ([`~transformers.BertTokenizer`]): + A `BertTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py index 5cdde87fee68..738a9ee0e228 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py @@ -44,9 +44,9 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline): bert ([`LDMBertModel`]): Text-encoder model based on [`~transformers.BERT`]. tokenizer ([`~transformers.BertTokenizer`]): - A [`~transformers.BertTokenizer`]. + A `BertTokenizer` to tokenize text. unet ([`UNet2DConditionModel`]): - A [`UNet2DConditionModel`] to denoise the encoded image latents. + A `UNet2DConditionModel` to denoise the encoded image latents. scheduler ([`SchedulerMixin`]): A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. diff --git a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py index fc9cd57a085f..5f8bfcb4ebda 100644 --- a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py +++ b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py @@ -60,12 +60,12 @@ class VQDiffusionPipeline(DiffusionPipeline): vqvae ([`VQModel`]): Vector Quantized Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. - text_encoder ([`CLIPTextModel`]): + text_encoder ([`~transformers.CLIPTextModel`]): Frozen text-encoder ([clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)). - tokenizer (`CLIPTokenizer`): - A [`~transformers.CLIPTokenizer`] to tokenize text. + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. transformer ([`Transformer2DModel`]): - A [`Transformer2DModel`] to denoise the encoded image latents. + A conditional `Transformer2DModel` to denoise the encoded image latents. scheduler ([`VQDiffusionScheduler`]): A scheduler to be used in combination with `transformer` to denoise the encoded image latents. """ @@ -303,8 +303,9 @@ def __call__( def truncate(self, log_p_x_0: torch.FloatTensor, truncation_rate: float) -> torch.FloatTensor: """ - Truncates log_p_x_0 such that for each column vector, the total cumulative probability is `truncation_rate` The - lowest probabilities that would increase the cumulative probability above `truncation_rate` are set to zero. + Truncates `log_p_x_0` such that for each column vector, the total cumulative probability is `truncation_rate` + The lowest probabilities that would increase the cumulative probability above `truncation_rate` are set to + zero. """ sorted_log_p_x_0, indices = torch.sort(log_p_x_0, 1, descending=True) sorted_p_x_0 = torch.exp(sorted_log_p_x_0) From 5675ed10f4befd7a2740f22961552a90eacb9ef8 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Thu, 13 Jul 2023 17:29:24 -0700 Subject: [PATCH 11/13] align doc titles --- docs/source/en/_toctree.yml | 10 +++++----- docs/source/en/api/pipelines/model_editing.mdx | 2 +- docs/source/en/api/pipelines/text_to_video.mdx | 2 +- docs/source/en/api/pipelines/text_to_video_zero.mdx | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index a55ebfc11a4a..3dc17d6202d8 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -181,7 +181,7 @@ - local: api/pipelines/alt_diffusion title: AltDiffusion - local: api/pipelines/attend_and_excite - title: Attend and Excite + title: Attend-and-Excite - local: api/pipelines/audio_diffusion title: Audio Diffusion - local: api/pipelines/audioldm @@ -211,7 +211,7 @@ - local: api/pipelines/latent_diffusion title: Latent Diffusion - local: api/pipelines/panorama - title: MultiDiffusion Panorama + title: MultiDiffusion - local: api/pipelines/paint_by_example title: PaintByExample - local: api/pipelines/paradigms @@ -265,11 +265,11 @@ - local: api/pipelines/stochastic_karras_ve title: Stochastic Karras VE - local: api/pipelines/model_editing - title: Text-to-Image Model Editing + title: Text-to-image model editing - local: api/pipelines/text_to_video - title: Text-to-Video + title: Text-to-video - local: api/pipelines/text_to_video_zero - title: Text-to-Video Zero + title: Text2Video-Zero - local: api/pipelines/unclip title: UnCLIP - local: api/pipelines/latent_diffusion_uncond diff --git a/docs/source/en/api/pipelines/model_editing.mdx b/docs/source/en/api/pipelines/model_editing.mdx index 823f4f20237b..4aa8a1d83fe4 100644 --- a/docs/source/en/api/pipelines/model_editing.mdx +++ b/docs/source/en/api/pipelines/model_editing.mdx @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Text-to-Image Model Editing +# Text-to-image model editing [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://huggingface.co/papers/2303.08084) is by Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. This pipeline enables editing diffusion model weights, such that its assumptions of a given concept are changed. The resulting change is expected to take effect in all prompt generations related to the edited concept. diff --git a/docs/source/en/api/pipelines/text_to_video.mdx b/docs/source/en/api/pipelines/text_to_video.mdx index 319a9cca16ff..6d28fb0e29d0 100644 --- a/docs/source/en/api/pipelines/text_to_video.mdx +++ b/docs/source/en/api/pipelines/text_to_video.mdx @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. -# Text-to-Video +# Text-to-video [VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://huggingface.co/papers/2303.08320) is by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan. diff --git a/docs/source/en/api/pipelines/text_to_video_zero.mdx b/docs/source/en/api/pipelines/text_to_video_zero.mdx index f5b4ace56c9f..b64d72db0187 100644 --- a/docs/source/en/api/pipelines/text_to_video_zero.mdx +++ b/docs/source/en/api/pipelines/text_to_video_zero.mdx @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Text-to-Video Zero +# Text2Video-Zero [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, From 1160b7dbc926bffb76c682b9efd8e6c12a1c3153 Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Tue, 18 Jul 2023 11:37:55 -0700 Subject: [PATCH 12/13] more review fixes --- .../pipeline_consistency_models.py | 4 ++-- .../controlnet/pipeline_controlnet_sd_xl.py | 17 +++++++---------- .../dance_diffusion/pipeline_dance_diffusion.py | 4 ++-- .../pipeline_latent_diffusion.py | 2 +- ...pipeline_latent_diffusion_superresolution.py | 2 +- src/diffusers/pipelines/pndm/pipeline_pndm.py | 4 ++-- .../pipelines/repaint/pipeline_repaint.py | 4 ++-- .../score_sde_ve/pipeline_score_sde_ve.py | 4 ++-- .../pipeline_stable_diffusion_depth2img.py | 4 +--- .../pipeline_stable_diffusion_latent_upscale.py | 4 +--- .../pipeline_stable_diffusion_xl.py | 12 +++++------- .../pipeline_stable_diffusion_xl_inpaint.py | 17 +++++++---------- .../pipeline_stochastic_karras_ve.py | 4 ++-- .../pipeline_stable_diffusion_adapter.py | 8 +++----- .../pipeline_versatile_diffusion.py | 6 +++--- .../pipeline_versatile_diffusion_dual_guided.py | 9 ++++----- ...eline_versatile_diffusion_image_variation.py | 4 +--- ...ipeline_versatile_diffusion_text_to_image.py | 4 +--- .../vq_diffusion/pipeline_vq_diffusion.py | 6 +++--- 19 files changed, 50 insertions(+), 69 deletions(-) diff --git a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py index 7fe1d12a251f..83cb37dc1e35 100644 --- a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py +++ b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py @@ -228,8 +228,8 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ # 0. Prepare call parameters img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py index 08729767b82f..b6a93672c733 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py @@ -136,17 +136,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -154,17 +152,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py index f7d2ea4fba62..b2d46c6f90f1 100644 --- a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py +++ b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py @@ -94,8 +94,8 @@ def __call__( Returns: [`~pipelines.AudioPipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.AudioPipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated audio. + If `return_dict` is `True`, [`~pipelines.AudioPipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated audio. """ if audio_length_in_s is None: diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py index 958d7750884a..e86f7b985e47 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py @@ -103,7 +103,7 @@ def __call__( output_type (`str`, *optional*, defaults to `"pil"`): The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. Example: diff --git a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py index f6806af3c37e..c8d5c1a1891d 100644 --- a/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py +++ b/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py @@ -94,7 +94,7 @@ def __call__( output_type (`str`, *optional*, defaults to `"pil"`): The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~ImagePipelineOutput`] instead of a plain tuple. + Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple. Example: diff --git a/src/diffusers/pipelines/pndm/pipeline_pndm.py b/src/diffusers/pipelines/pndm/pipeline_pndm.py index 747747b6f001..4add91fd1a69 100644 --- a/src/diffusers/pipelines/pndm/pipeline_pndm.py +++ b/src/diffusers/pipelines/pndm/pipeline_pndm.py @@ -91,8 +91,8 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ # For more information on the sampling method you can take a look at Algorithm 2 of # the official paper: https://arxiv.org/pdf/2202.09778.pdf diff --git a/src/diffusers/pipelines/repaint/pipeline_repaint.py b/src/diffusers/pipelines/repaint/pipeline_repaint.py index 038f9280c782..8200b9db630d 100644 --- a/src/diffusers/pipelines/repaint/pipeline_repaint.py +++ b/src/diffusers/pipelines/repaint/pipeline_repaint.py @@ -180,8 +180,8 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ original_image = image diff --git a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py index 69aec5b60a44..ace4f0c60db8 100644 --- a/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py +++ b/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py @@ -68,8 +68,8 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py index 701526a3c154..e7fe39b9c865 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py @@ -598,9 +598,7 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, - otherwise a `tuple` is returned where the first element is a list with the generated images and the - second element is a list of `bool`s indicating whether the corresponding generated image contains - "not-safe-for-work" (nsfw) content. + otherwise a `tuple` is returned where the first element is a list with the generated images. """ # 1. Check inputs self.check_inputs( diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py index da1fa84b41d9..cad82cb71940 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_latent_upscale.py @@ -357,9 +357,7 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, - otherwise a `tuple` is returned where the first element is a list with the generated images and the - second element is a list of `bool`s indicating whether the corresponding generated image contains - "not-safe-for-work" (nsfw) content. + otherwise a `tuple` is returned where the first element is a list with the generated images. """ # 1. Check inputs diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py index 9b5c18c78a4c..6f0ecf4df8d6 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py @@ -649,8 +649,8 @@ def __call__( The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead + of a plain tuple. callback (`Callable`, *optional*): A function that will be called every `callback_steps` steps during inference. The function will be called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. @@ -676,11 +676,9 @@ def __call__( Examples: Returns: - [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] or `tuple`: - [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a - `tuple. When returning a tuple, the first element is a list with the generated images, and the second - element is a list of `bool`s denoting whether the corresponding generated image likely represents - "not-safe-for-work" (nsfw) content, according to the `safety_checker`. + [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] or `tuple`: + [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a + `tuple`. When returning a tuple, the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.default_sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py index 6a5187c2e7cf..bc55d1322839 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py @@ -258,17 +258,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() @@ -276,17 +274,16 @@ def disable_vae_slicing(self): # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling def enable_vae_tiling(self): r""" - Enable tiled VAE decoding. - - When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in - several steps. This is useful to save a large amount of memory and to allow the processing of larger images. + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. """ self.vae.enable_tiling() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling def disable_vae_tiling(self): r""" - Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_tiling() diff --git a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py index 5273120c9ab0..61b5ed2d160f 100644 --- a/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py +++ b/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py @@ -72,8 +72,8 @@ def __call__( Returns: [`~pipelines.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.utils.ImagePipelineOutput`] is returned, otherwise a `tuple` - is returned where the first element is a list with the generated images. + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ img_size = self.unet.config.sample_size diff --git a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py index 97c6fb99157f..49ab2304c146 100644 --- a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py +++ b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py @@ -203,17 +203,15 @@ def __init__( # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing def enable_vae_slicing(self): r""" - Enable sliced VAE decoding. - - When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several - steps. This is useful to save some memory and allow larger batch sizes. + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.vae.enable_slicing() # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing def disable_vae_slicing(self): r""" - Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.vae.disable_slicing() diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py index 131caa2e0cf7..68c720ab2ad0 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion.py @@ -391,9 +391,9 @@ def dual_guided( ``` Returns: - [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise - a `tuple` is returned where the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ expected_components = inspect.signature(VersatileDiffusionDualGuidedPipeline.__init__).parameters.keys() diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py index 12da760e14f3..ed61b37171f1 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_dual_guided.py @@ -430,8 +430,7 @@ def __call__( output_type (`str`, *optional*, defaults to `"pil"`): The output format of the generated image. Choose between `PIL.Image` or `np.array`. return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a - plain tuple. + Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. callback (`Callable`, *optional*): A function that calls every `callback_steps` steps during inference. The function is called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. @@ -471,9 +470,9 @@ def __call__( ``` Returns: - [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise - a `tuple` is returned where the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py index 80f028798631..1bc14f7dc492 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_image_variation.py @@ -319,9 +319,7 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, - otherwise a `tuple` is returned where the first element is a list with the generated images and the - second element is a list of `bool`s indicating whether the corresponding generated image contains - "not-safe-for-work" (nsfw) content. + otherwise a `tuple` is returned where the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py index 738a9ee0e228..3d88f4ee4416 100644 --- a/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py +++ b/src/diffusers/pipelines/versatile_diffusion/pipeline_versatile_diffusion_text_to_image.py @@ -393,9 +393,7 @@ def __call__( Returns: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, - otherwise a `tuple` is returned where the first element is a list with the generated images and the - second element is a list of `bool`s indicating whether the corresponding generated image contains - "not-safe-for-work" (nsfw) content. + otherwise a `tuple` is returned where the first element is a list with the generated images. """ # 0. Default height and width to unet height = height or self.image_unet.config.sample_size * self.vae_scale_factor diff --git a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py index 5f8bfcb4ebda..1abe50a9b6b6 100644 --- a/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py +++ b/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py @@ -212,9 +212,9 @@ def __call__( every step. Returns: - [`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: - If `return_dict` is `True`, [`~pipelines.stable_diffusion.ImagePipelineOutput`] is returned, otherwise - a `tuple` is returned where the first element is a list with the generated images. + [`~pipelines.ImagePipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is + returned where the first element is a list with the generated images. """ if isinstance(prompt, str): batch_size = 1 From f0a672888a442598ca1151de216e1c5d69c437aa Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 21 Jul 2023 10:43:10 -0700 Subject: [PATCH 13/13] final review --- docs/source/en/api/pipelines/alt_diffusion.mdx | 4 ++++ docs/source/en/api/pipelines/audioldm.mdx | 2 +- docs/source/en/api/pipelines/consistency_models.mdx | 6 ------ docs/source/en/api/pipelines/ddim.mdx | 8 +------- docs/source/en/api/pipelines/diffedit.mdx | 4 +++- 5 files changed, 9 insertions(+), 15 deletions(-) diff --git a/docs/source/en/api/pipelines/alt_diffusion.mdx b/docs/source/en/api/pipelines/alt_diffusion.mdx index e3d248f31db4..ed8db52f9a51 100644 --- a/docs/source/en/api/pipelines/alt_diffusion.mdx +++ b/docs/source/en/api/pipelines/alt_diffusion.mdx @@ -18,6 +18,10 @@ The abstract from the paper is: *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.* +## Tips + +`AltDiffusion` is conceptually the same as [Stable Diffusion](./stable_diffusion/overview). + Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. diff --git a/docs/source/en/api/pipelines/audioldm.mdx b/docs/source/en/api/pipelines/audioldm.mdx index 2407a205c92b..e810c9e27a28 100644 --- a/docs/source/en/api/pipelines/audioldm.mdx +++ b/docs/source/en/api/pipelines/audioldm.mdx @@ -21,7 +21,7 @@ The abstract from the paper is: *Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.* -The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM), and the pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). +The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). ## Tips diff --git a/docs/source/en/api/pipelines/consistency_models.mdx b/docs/source/en/api/pipelines/consistency_models.mdx index cbe2691e67d5..26f73e88b409 100644 --- a/docs/source/en/api/pipelines/consistency_models.mdx +++ b/docs/source/en/api/pipelines/consistency_models.mdx @@ -34,12 +34,6 @@ For an additional speed-up, use `torch.compile` to generate multiple images in < image.show() ``` - - -Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - ## ConsistencyModelPipeline [[autodoc]] ConsistencyModelPipeline - all diff --git a/docs/source/en/api/pipelines/ddim.mdx b/docs/source/en/api/pipelines/ddim.mdx index 04e2f0a33bce..c2bf95c4e566 100644 --- a/docs/source/en/api/pipelines/ddim.mdx +++ b/docs/source/en/api/pipelines/ddim.mdx @@ -18,13 +18,7 @@ The abstract from the paper is: *Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* -The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author at [tsong.me](https://tsong.me/). - - - -Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim). ## DDIMPipeline [[autodoc]] DDIMPipeline diff --git a/docs/source/en/api/pipelines/diffedit.mdx b/docs/source/en/api/pipelines/diffedit.mdx index 986b9ec6a9c1..bb2ade6125ad 100644 --- a/docs/source/en/api/pipelines/diffedit.mdx +++ b/docs/source/en/api/pipelines/diffedit.mdx @@ -18,7 +18,9 @@ The abstract from the paper is: *Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.* -The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion/](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). +The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). + +This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️ ## Tips