From 1835f428120df1f02c06ebf6bcc989d613b59955 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 1 Aug 2023 14:09:32 -0700
Subject: [PATCH 01/11] first draft

---
 docs/source/en/_toctree.yml                   |   2 +
 .../stable_diffusion/stable_diffusion_xl.md   | 423 +-----------------
 docs/source/en/using-diffusers/sdxl.md        | 350 +++++++++++++++
 3 files changed, 359 insertions(+), 416 deletions(-)
 create mode 100644 docs/source/en/using-diffusers/sdxl.md

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index ae45906bc3c6..ee1ed1bbb34a 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -38,6 +38,8 @@
   - sections:
     - local: using-diffusers/pipeline_overview
       title: Overview
+    - local: using-diffusers/sdxl
+      title: Stable Diffusion XL
     - local: using-diffusers/unconditional_image_generation
       title: Unconditional image generation
     - local: using-diffusers/conditional_image_generation
diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
index f6585f819928..696cc34aee4c 100644
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -10,414 +10,27 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
 
-# Stable diffusion XL
+# Stable Diffusion XL
 
-Stable Diffusion XL was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
+Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach.
 
-The abstract of the paper is the following:
+The abstract from the paper is:
 
 *We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
 
 ## Tips
 
-- Stable Diffusion XL works especially well with images between 768 and 1024.
-- Stable Diffusion XL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
-- Stable Diffusion XL output image can be improved by making use of a refiner as shown below.
-- One can make use of `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to influence the generation process.
+- SDXL works especially well with images between 768 and 1024.
+- SDXL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
+- SDXL output image can be improved by making use of a refiner as shown below.
 
-### Available checkpoints:
-
-- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with [`StableDiffusionXLPipeline`]
-- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) with [`StableDiffusionXLImg2ImgPipeline`]
-
-## Usage Example
-
-Before using SDXL make sure to have `transformers`, `accelerate`, `safetensors` and `invisible_watermark` installed. 
-You can install the libraries as follows:
-
-```
-pip install transformers
-pip install accelerate
-pip install safetensors
-```
-
-### Watermarker
-
-We recommend to add an invisible watermark to images generating by Stable Diffusion XL, this can help with identifying if an image is machine-synthesised for downstream applications. To do so, please install
-the [invisible-watermark library](https://pypi.org/project/invisible-watermark/) via:
-
-```
-pip install invisible-watermark>=0.2.0
-```
-
-If the `invisible-watermark` library is installed the watermarker will be used **by default**.
-
-If you have other provisions for generating or deploying images safely, you can disable the watermarker as follows:
-
-```py
-pipe = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
-```
-
-### Text-to-Image
-
-You can use SDXL as follows for *text-to-image*:
-
-```py
-from diffusers import StableDiffusionXLPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-image = pipe(prompt=prompt).images[0]
-```
-
-You can additionally pass negative conditions about an image's size and position to avoid undesirable cropping behavior in the generated image, and improve image resolution. Let's take an example:
-
-```python
-from diffusers import StableDiffusionXLPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-image = pipe(
-    prompt=prompt,
-    negative_original_size=(512, 512),
-    negative_crops_coords_top_left=(0, 0),
-    negative_target_size=(1024, 1024),
-).images[0]
-```
-
-Here is a comparative example that shows the influence of using three `negative_original_size`s of
-(128, 128), (256, 256), and (512, 512) respectively:
-
-![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png)
-
-<Tip>
-
-One can use these negative conditions in the other SDXL pipelines ([Image-To-Image](#image-to-image), [Inpainting](#inpainting), [ControlNet](../controlnet_sdxl.md)) too!
-
-</Tip>
-
-### Image-to-image 
-
-You can use SDXL as follows for *image-to-image*:
-
-```py 
-import torch
-from diffusers import StableDiffusionXLImg2ImgPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe = pipe.to("cuda")
-url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
-
-init_image = load_image(url).convert("RGB")
-prompt = "a photo of an astronaut riding a horse on mars"
-image = pipe(prompt, image=init_image).images[0]
-```
-
-### Inpainting
-
-You can use SDXL as follows for *inpainting*
-
-```py 
-import torch
-from diffusers import StableDiffusionXLInpaintPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
-mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-
-init_image = load_image(img_url).convert("RGB")
-mask_image = load_image(mask_url).convert("RGB")
-
-prompt = "A majestic tiger sitting on a bench"
-image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
-```
-
-### Refining the image output
-
-In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), 
-StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)
-that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality.
-This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve
-image quality.
-
-When using the refiner, one can easily 
-- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or
-- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model.
-
-**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by 
-a couple community contributors which also helped shape the following `diffusers` implementation, namely:
-- [SytanSD](https://github.com/SytanSD)
-- [bghira](https://github.com/bghira)
-- [Birch-san](https://github.com/Birch-san)
-- [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter)
-
-#### 1.) Ensemble of Expert Denoisers
-
-When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the 
-expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage.
-
-The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly
-faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised.
-
-To use the base model and refiner as an ensemble of expert denoisers, make sure to define the span
-of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise
-denoising stage (*i.e.* the refiner model) respectively. We can set the intervals using the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) of the base model 
-and [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) of the refiner model.
-
-For both `denoising_end` and `denoising_start` a float value between 0 and 1 should be passed.
-When passed, the end and start of denoising will be defined by proportions of discrete timesteps as
-defined by the model schedule.
-Note that this will override `strength` if it is also declared, since the number of denoising steps
-is determined by the discrete timesteps the model was trained on and the declared fractional cutoff.
-
-Let's look at an example.
-First, we import the two pipelines. Since the text encoders and variational autoencoder are the same
-you don't have to load those again for the refiner.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-base = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-base.to("cuda")
-
-refiner = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-1.0",
-    text_encoder_2=base.text_encoder_2,
-    vae=base.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-```
-
-Now we define the number of inference steps and the point at which the model shall be run through the 
-high-noise denoising stage (*i.e.* the base model).
-
-```py
-n_steps = 40
-high_noise_frac = 0.8
-```
-
-Stable Diffusion XL base is trained on timesteps 0-999 and Stable Diffusion XL refiner is finetuned
-from the base model on low noise timesteps 0-199 inclusive, so we use the base model for the first
-800 timesteps (high noise) and the refiner for the last 200 timesteps (low noise). Hence, `high_noise_frac`
-is set to 0.8, so that all steps 200-999 (the first 80% of denoising timesteps) are performed by the
-base model and steps 0-199 (the last 20% of denoising timesteps) are performed by the refiner model.
-
-Remember, the denoising process starts at **high value** (high noise) timesteps and ends at
-**low value** (low noise) timesteps.
-
-Let's run the two pipelines now. Make sure to set `denoising_end` and
-`denoising_start` to the same values and keep `num_inference_steps` constant. Also remember that
-the output of the base model should be in latent space:
-
-```py
-prompt = "A majestic lion jumping from a big stone at night"
-
-image = base(
-    prompt=prompt,
-    num_inference_steps=n_steps,
-    denoising_end=high_noise_frac,
-    output_type="latent",
-).images
-image = refiner(
-    prompt=prompt,
-    num_inference_steps=n_steps,
-    denoising_start=high_noise_frac,
-    image=image,
-).images[0]
-```
-
-Let's have a look at the images
-
-| Original Image | Ensemble of Denoisers Experts |
-|---|---|
-| ![lion_base_timesteps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png) | ![lion_refined_timesteps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png)
-
-If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose):
 
 <Tip>
 
-The ensemble-of-experts method works well on all available schedulers!
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! To learn how to use [`StableDiffusionXLPipeline`] for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](/using-diffusers/sdxl) guide.
 
 </Tip>
 
-#### 2.) Refining the image output from fully denoised base image
-
-In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model 
-can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0).
-
-For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image 
-pipeline. You can leave the outputs of the base model in latent space.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-pipe = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-1.0",
-    text_encoder_2=pipe.text_encoder_2,
-    vae=pipe.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-
-image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
-image = refiner(prompt=prompt, image=image[None, :]).images[0]
-```
-
-| Original Image | Refined Image |
-|---|---|
-| ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/init_image.png) | ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/refined_image.png) |
-
-<Tip>
-
-The refiner can also very well be used in an in-painting setting. To do so just make
-  sure you use the [`StableDiffusionXLInpaintPipeline`] classes as shown below
-
-</Tip>
-
-To use the refiner for inpainting in the Ensemble of Expert Denoisers setting you can do the following:
-
-```py
-from diffusers import StableDiffusionXLInpaintPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-1.0",
-    text_encoder_2=pipe.text_encoder_2,
-    vae=pipe.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-
-img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
-mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-
-init_image = load_image(img_url).convert("RGB")
-mask_image = load_image(mask_url).convert("RGB")
-
-prompt = "A majestic tiger sitting on a bench"
-num_inference_steps = 75
-high_noise_frac = 0.7
-
-image = pipe(
-    prompt=prompt,
-    image=init_image,
-    mask_image=mask_image,
-    num_inference_steps=num_inference_steps,
-    denoising_start=high_noise_frac,
-    output_type="latent",
-).images
-image = refiner(
-    prompt=prompt,
-    image=image,
-    mask_image=mask_image,
-    num_inference_steps=num_inference_steps,
-    denoising_start=high_noise_frac,
-).images[0]
-```
-
-To use the refiner for inpainting in the standard SDE-style setting, simply remove `denoising_end` and `denoising_start` and choose a smaller
-number of inference steps for the refiner.
-
-### Loading single file checkpoints / original file format
-
-By making use of [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] you can also load the 
-original file format into `diffusers`:
-
-```py
-from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_single_file(
-    "./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
-    "./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
-)
-refiner.to("cuda")
-```
-
-### Memory optimization via model offloading 
-
-If you are seeing out-of-memory errors, we recommend making use of [`StableDiffusionXLPipeline.enable_model_cpu_offload`].
-
-```diff
-- pipe.to("cuda")
-+ pipe.enable_model_cpu_offload()
-```
-
-and 
-
-```diff
-- refiner.to("cuda")
-+ refiner.enable_model_cpu_offload()
-```
-
-### Speed-up inference with `torch.compile`
-
-You can speed up inference by making use of `torch.compile`. This should give you **ca.** 20% speed-up.
-
-```diff
-+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
-+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
-```
-
-### Running with `torch < 2.0`
-
-**Note** that if you want to run Stable Diffusion XL with `torch` < 2.0, please make sure to enable xformers 
-attention:
-
-```
-pip install xformers
-```
-
-```diff
-+pipe.enable_xformers_memory_efficient_attention()
-+refiner.enable_xformers_memory_efficient_attention()
-```
-
 ## StableDiffusionXLPipeline
 
 [[autodoc]] StableDiffusionXLPipeline
@@ -435,25 +48,3 @@ pip install xformers
 [[autodoc]] StableDiffusionXLInpaintPipeline
 	- all
 	- __call__
-
-### Passing different prompts to each text-encoder
-
-Stable Diffusion XL was trained on two text encoders. The default behavior is to pass the same prompt to each. But it is possible to pass a different prompt for each text-encoder, as [some users](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201) noted that it can boost quality.
-To do so, you can pass `prompt_2` and `negative_prompt_2` in addition to `prompt` and `negative_prompt`. By doing that, you will pass the original prompts and negative prompts (as in `prompt` and `negative_prompt`) to `text_encoder` (in official SDXL 0.9/1.0 that is [OpenAI CLIP-ViT/L-14](https://huggingface.co/openai/clip-vit-large-patch14)),
-and `prompt_2` and `negative_prompt_2` to `text_encoder_2` (in official SDXL 0.9/1.0 that is [OpenCLIP-ViT/bigG-14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
-
-```py
-from diffusers import StableDiffusionXLPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-# prompt will be passed to OAI CLIP-ViT/L-14
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-# prompt_2 will be passed to OpenCLIP-ViT/bigG-14
-prompt_2 = "monet painting"
-image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]
-```
diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
new file mode 100644
index 000000000000..c087b20a8802
--- /dev/null
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -0,0 +1,350 @@
+# Stable Diffusion XL
+
+[[open-in-colab]]
+
+[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways:
+
+1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters
+2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped
+3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details
+
+This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install transformers accelerate safetensors invisible-watermark>=0.2.0
+```
+
+<Tip warning={true}>
+
+We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker:
+
+```py
+pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
+```
+
+</Tip>
+
+## Load single file formats
+
+Use the [`~StableDiffusionXLPipeline.from_single_file`] method to load single file formats (`.ckpt` or `.safetensors`) into 🤗 Diffusers (otherwise you can use [`~StableDiffusionXLPipeline.from_pretrained`]):
+
+```py
+from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_single_file(
+    "./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
+    "./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+).to("cuda")
+```
+
+## Text-to-image
+
+For text-to-image, pass a text prompt:
+
+```py
+from diffusers import AutoPipeline
+import torch
+
+pipeline_text2image = AutoPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipeline(prompt=prompt).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" alt="generated image of an astronaut in a jungle"/>
+</div>
+
+## Image-to-image
+
+For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
+
+```py
+from diffusers import AutoPipeline
+from diffusers.utils import load_image
+
+# use from_pipe to avoid consuming additional memory when loading a checkpoint
+pipeline = AutoPipeline.from_pipe(pipeline_text2image).to("cuda")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png"
+
+init_image = load_image(url).convert("RGB")
+prompt = "a dog catching a frisbee in the jungle"
+image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" alt="generated image of a dog catching a frisbee in a jungle"/>
+</div>
+
+## Inpainting
+
+For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with.
+
+```py
+from diffusers import AutoPipeline
+from diffusers.utils import load_image
+
+# use from_pipe to avoid consuming additional memory when loading a checkpoint
+pipeline = AutoPipeline.from_pipe(pipeline_text2image).to("cuda")
+
+img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
+mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
+
+init_image = load_image(img_url).convert("RGB")
+mask_image = load_image(mask_url).convert("RGB")
+
+prompt = "A deep sea diver floating"
+image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint.png" alt="generated image of a deep sea diver in a jungle"/>
+</div>
+
+## Refine image quality
+
+SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
+
+1. use the base and refiner model as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/) (❤️ thanks to the following contributors for proposing and implementing this method: [SytanSD](https://github.com/SytanSD), [bghira](https://github.com/bghira), [Birch-san](https://github.com/Birch-san), [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter))
+2. use the refiner with [SDEdit](https://huggingface.co/papers/2108.01073) after running the base model (this is how SDXL is originally trained)
+
+### Ensemble of expert denoisers
+
+The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it is heavily denoised.
+
+As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+base = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-refiner-1.0",
+    text_encoder_2=base.text_encoder_2,
+    vae=base.vae,
+    torch_dtype=torch.float16,
+    use_safetensors=True,
+    variant="fp16",
+).to("cuda")
+```
+
+To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter.
+
+<Tip>
+
+The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff.
+
+</Tip>
+
+Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image.
+
+```py
+prompt = "A majestic lion jumping from a big stone at night"
+
+image = base(
+    prompt=prompt,
+    num_inference_steps=40,
+    denoising_end=0.8,
+    output_type="latent",
+).images
+image = refiner(
+    prompt=prompt,
+    num_inference_steps=40,
+    denoising_start=0.8,
+    image=image,
+).images[0]
+```
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png" alt="generated image of a lion on a rock at night" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png" alt="generated image of a lion on a rock at night in higher quality" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">ensemble of expert denoisers</figcaption>
+  </div>
+</div>
+
+For inpainting, use the [`StableDiffusionXLInpaintPipeline`]:
+
+```py
+from diffusers import StableDiffusionXLInpaintPipeline
+from diffusers.utils import load_image
+
+base = StableDiffusionXLInpaintPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-refiner-1.0",
+    text_encoder_2=pipe.text_encoder_2,
+    vae=pipe.vae,
+    torch_dtype=torch.float16,
+    use_safetensors=True,
+    variant="fp16",
+).to("cuda")
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url).convert("RGB")
+mask_image = load_image(mask_url).convert("RGB")
+
+prompt = "A majestic tiger sitting on a bench"
+num_inference_steps = 75
+high_noise_frac = 0.7
+
+image = base(
+    prompt=prompt,
+    image=init_image,
+    mask_image=mask_image,
+    num_inference_steps=num_inference_steps,
+    denoising_end=high_noise_frac,
+    output_type="latent",
+).images
+image = refiner(
+    prompt=prompt,
+    image=image,
+    mask_image=mask_image,
+    num_inference_steps=num_inference_steps,
+    denoising_start=high_noise_frac,
+).images[0]
+```
+
+This ensemble of expert denoisers method works well for all available schedulers!
+
+### Refine fully-denoised base image
+
+SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, similar to image-to-image generation.
+
+Load the base and refiner models:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+base = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-refiner-1.0",
+    text_encoder_2=pipe.text_encoder_2,
+    vae=pipe.vae,
+    torch_dtype=torch.float16,
+    use_safetensors=True,
+    variant="fp16",
+).to("cuda")
+```
+
+Generate an image from the base model, and set the model output to **latent** space:
+
+```py
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
+```
+
+Pass the generated image to the refiner model:
+
+```py
+image = refiner(prompt=prompt, image=image[None, :]).images[0]
+```
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/init_image.png" alt="generated image of an astronaut riding a green horse on Mars" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/refined_image.png" alt="higher quality generated image of an astronaut riding a green horse on Mars" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">base model + refiner model</figcaption>
+  </div>
+</div>
+
+For inpainting, use the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
+
+## Use a different prompt for each text-encoder
+
+SDXL uses two text-encoders so it is possible to pass a different prompt to each text-encoder which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompt):
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+# prompt is passed to OAI CLIP-ViT/L-14
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+# prompt_2 is passed to OpenCLIP-ViT/bigG-14
+prompt_2 = "Van Gogh painting"
+image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
+</div>
+
+## Cropped image generation
+
+Images generated from previous Stable Diffusion models may sometimes appear to be randomly cropped due to how the model is trained. By conditioning SDXL on the cropping parameters, SDXL is able to generate images that are more centered and subjects in the images aren't randomly cut off. You can control the amount of cropping during inference with the [`crops_coords_top_left`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.crops_coords_top_left) parameter. By default, `crops_coords_top_left` is (0, 0) for a centered image.
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-cropped.png" alt="generated image of an astronaut in a jungle, slightly cropped"/>
+</div>
+
+## Optimizations
+
+SDXL is a large model, and you may need to optimize your memory to get it to run on hardware. Here are some tips to save memory and speed up inference.
+
+1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors:
+
+```diff
+- base.to("cuda")
+- refiner.to("cuda")
++ base.enable_model_cpu_offload
++ refiner.enable_model_cpu_offload
+```
+
+2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`):
+
+```diff
++ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
++ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`:
+
+```diff
++ base.enable_xformers_memory_efficient_attention()
++ refiner.enable_xformers_memory_efficient_attention()
+```
\ No newline at end of file

From d4badb8e2f4f61e7ee4f56b677323c4da37ac566 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Thu, 10 Aug 2023 11:47:57 -0700
Subject: [PATCH 02/11] reorg toctree

---
 docs/source/en/_toctree.yml | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index ee1ed1bbb34a..8f5da8332e85 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -3,6 +3,8 @@
     title: 🧨 Diffusers
   - local: quicktour
     title: Quicktour
+  - local: using-diffusers/stable_diffusion_jax_how_to
+    title: Stable Diffusion in JAX/Flax
   - local: stable_diffusion
     title: Effective and efficient diffusion
   - local: installation
@@ -36,40 +38,42 @@
       title: Push files to the Hub
     title: Loading & Hub
   - sections:
-    - local: using-diffusers/pipeline_overview
-      title: Overview
-    - local: using-diffusers/sdxl
-      title: Stable Diffusion XL
     - local: using-diffusers/unconditional_image_generation
       title: Unconditional image generation
     - local: using-diffusers/conditional_image_generation
-      title: Text-to-image generation
+      title: Text-to-image
     - local: using-diffusers/img2img
-      title: Text-guided image-to-image
+      title: Image-to-image
     - local: using-diffusers/inpaint
-      title: Text-guided image-inpainting
+      title: Inpainting
     - local: using-diffusers/depth2img
-      title: Text-guided depth-to-image
+      title: Depth-to-image
+    title: Tasks
+  - sections:
     - local: using-diffusers/textual_inversion_inference
       title: Textual inversion
     - local: training/distributed_inference
       title: Distributed inference with multiple GPUs
-    - local: using-diffusers/distilled_sd
-      title: Distilled Stable Diffusion inference
     - local: using-diffusers/reusing_seeds
       title: Improve image quality with deterministic generation
     - local: using-diffusers/control_brightness
       title: Control image brightness
+    - local: using-diffusers/weighted_prompts
+      title: Prompt weighting
+    title: Techniques
+  - sections:
+    - local: using-diffusers/pipeline_overview
+      title: Overview
+    - local: using-diffusers/sdxl
+      title: Stable Diffusion XL
+    - local: using-diffusers/distilled_sd
+      title: Distilled Stable Diffusion inference
     - local: using-diffusers/reproducibility
       title: Create reproducible pipelines
     - local: using-diffusers/custom_pipeline_examples
       title: Community pipelines
     - local: using-diffusers/contribute_pipeline
       title: How to contribute a community pipeline
-    - local: using-diffusers/stable_diffusion_jax_how_to
-      title: Stable Diffusion in JAX/Flax
-    - local: using-diffusers/weighted_prompts
-      title: Prompt weighting
     title: Pipelines for Inference
   - sections:
     - local: training/overview

From b5146c7fd79d96fae4f7f68913382277ec54b9a7 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Thu, 10 Aug 2023 13:00:45 -0700
Subject: [PATCH 03/11] note about minsdxl

---
 docs/source/en/using-diffusers/sdxl.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index c087b20a8802..60d1ad551d94 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -347,4 +347,8 @@ SDXL is a large model, and you may need to optimize your memory to get it to run
 ```diff
 + base.enable_xformers_memory_efficient_attention()
 + refiner.enable_xformers_memory_efficient_attention()
-```
\ No newline at end of file
+```
+
+## Other resources
+
+If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers.
\ No newline at end of file

From 73dcc5495e6ca3a5c5885a4a8de3944d41a08675 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Thu, 24 Aug 2023 13:55:58 -0700
Subject: [PATCH 04/11] feedback

---
 docs/source/en/_toctree.yml                   |  4 +-
 .../stable_diffusion/stable_diffusion_xl.md   |  8 ++-
 .../en/using-diffusers/pipeline_overview.md   |  4 +-
 docs/source/en/using-diffusers/sdxl.md        | 59 ++++++++++++-------
 4 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 8f5da8332e85..57b80ca54427 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -3,8 +3,6 @@
     title: 🧨 Diffusers
   - local: quicktour
     title: Quicktour
-  - local: using-diffusers/stable_diffusion_jax_how_to
-    title: Stable Diffusion in JAX/Flax
   - local: stable_diffusion
     title: Effective and efficient diffusion
   - local: installation
@@ -111,6 +109,8 @@
     title: Memory and Speed
   - local: optimization/torch2.0
     title: Torch2.0 support
+  - local: using-diffusers/stable_diffusion_jax_how_to
+    title: Stable Diffusion in JAX/Flax
   - local: optimization/xformers
     title: xFormers
   - local: optimization/onnx
diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
index 696cc34aee4c..94b1e5fe107c 100644
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
 
 # Stable Diffusion XL
 
-Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach.
+Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
 
 The abstract from the paper is:
 
@@ -22,12 +22,14 @@ The abstract from the paper is:
 
 - SDXL works especially well with images between 768 and 1024.
 - SDXL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
-- SDXL output image can be improved by making use of a refiner as shown below.
+- SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
 
 
 <Tip>
 
-Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! To learn how to use [`StableDiffusionXLPipeline`] for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](/using-diffusers/sdxl) guide.
+To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](/using-diffusers/sdxl) guide.
+
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! 
 
 </Tip>
 
diff --git a/docs/source/en/using-diffusers/pipeline_overview.md b/docs/source/en/using-diffusers/pipeline_overview.md
index ca98fc3f4b63..4ee25b51dc6f 100644
--- a/docs/source/en/using-diffusers/pipeline_overview.md
+++ b/docs/source/en/using-diffusers/pipeline_overview.md
@@ -12,6 +12,6 @@ specific language governing permissions and limitations under the License.
 
 # Overview
 
-A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
+A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
 
-This section introduces you to some of the tasks supported by our pipelines such as unconditional image generation and different techniques and variations of text-to-image generation. You'll also learn how to gain more control over the generation process by setting a seed for reproducibility and weighting prompts to adjust the influence certain words in the prompt has over the output. Finally, you'll see how you can create a community pipeline for a custom task like generating images from speech.
\ No newline at end of file
+This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech.
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 60d1ad551d94..1fbce5922e2a 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -14,7 +14,7 @@ Before you begin, make sure you have the following libraries installed:
 
 ```py
 # uncomment to install the necessary libraries in Colab
-#!pip install transformers accelerate safetensors invisible-watermark>=0.2.0
+#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0
 ```
 
 <Tip warning={true}>
@@ -27,20 +27,35 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
 
 </Tip>
 
-## Load single file formats
+## Load model checkpoints
 
-Use the [`~StableDiffusionXLPipeline.from_single_file`] method to load single file formats (`.ckpt` or `.safetensors`) into 🤗 Diffusers (otherwise you can use [`~StableDiffusionXLPipeline.from_pretrained`]):
+Use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:
 
 ```py
 from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
 import torch
 
 pipeline = StableDiffusionXLPipeline.from_single_file(
-    "./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 
 refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
-    "./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+    "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+).to("cuda")
+```
+
+Model weights may also be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
+
+```py
+from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
+    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
 ).to("cuda")
 ```
 
@@ -49,10 +64,10 @@ refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
 For text-to-image, pass a text prompt:
 
 ```py
-from diffusers import AutoPipeline
+from diffusers import AutoPipelineForText2Image
 import torch
 
-pipeline_text2image = AutoPipeline.from_pretrained(
+pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
     "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 
@@ -69,11 +84,11 @@ image = pipeline(prompt=prompt).images[0]
 For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
 
 ```py
-from diffusers import AutoPipeline
+from diffusers import AutoPipelineForImg2Img
 from diffusers.utils import load_image
 
 # use from_pipe to avoid consuming additional memory when loading a checkpoint
-pipeline = AutoPipeline.from_pipe(pipeline_text2image).to("cuda")
+pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png"
 
 init_image = load_image(url).convert("RGB")
@@ -90,11 +105,11 @@ image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).im
 For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with.
 
 ```py
-from diffusers import AutoPipeline
+from diffusers import AutoPipelineForInpainting
 from diffusers.utils import load_image
 
 # use from_pipe to avoid consuming additional memory when loading a checkpoint
-pipeline = AutoPipeline.from_pipe(pipeline_text2image).to("cuda")
+pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
 
 img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
 mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
@@ -103,7 +118,7 @@ init_image = load_image(img_url).convert("RGB")
 mask_image = load_image(mask_url).convert("RGB")
 
 prompt = "A deep sea diver floating"
-image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
 ```
 
 <div class="flex justify-center">
@@ -114,12 +129,12 @@ image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.
 
 SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
 
-1. use the base and refiner model as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/) (❤️ thanks to the following contributors for proposing and implementing this method: [SytanSD](https://github.com/SytanSD), [bghira](https://github.com/bghira), [Birch-san](https://github.com/Birch-san), [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter))
-2. use the refiner with [SDEdit](https://huggingface.co/papers/2108.01073) after running the base model (this is how SDXL is originally trained)
+1. use the base and refiner model together to produce a refined image
+2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained)
 
-### Ensemble of expert denoisers
+### Base + refiner model
 
-The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it is heavily denoised.
+When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
 
 As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
 
@@ -179,7 +194,7 @@ image = refiner(
   </div>
 </div>
 
-For inpainting, use the [`StableDiffusionXLInpaintPipeline`]:
+The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]:
 
 ```py
 from diffusers import StableDiffusionXLInpaintPipeline
@@ -227,9 +242,9 @@ image = refiner(
 
 This ensemble of expert denoisers method works well for all available schedulers!
 
-### Refine fully-denoised base image
+### Base model to refiner model
 
-SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, similar to image-to-image generation.
+SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting.
 
 Load the base and refiner models:
 
@@ -256,7 +271,7 @@ Generate an image from the base model, and set the model output to **latent** sp
 ```py
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 
-image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
+image = base(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
 ```
 
 Pass the generated image to the refiner model:
@@ -276,7 +291,7 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
   </div>
 </div>
 
-For inpainting, use the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
+For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
 
 ## Use a different prompt for each text-encoder
 
@@ -324,7 +339,7 @@ image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0]
 
 ## Optimizations
 
-SDXL is a large model, and you may need to optimize your memory to get it to run on hardware. Here are some tips to save memory and speed up inference.
+SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference.
 
 1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors:
 

From c3f63aeddcee3a4548d969e80af8370b48c03fa7 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Thu, 24 Aug 2023 14:26:58 -0700
Subject: [PATCH 05/11] fix

---
 docs/source/en/using-diffusers/sdxl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 1fbce5922e2a..8cb755e1ba24 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -242,7 +242,7 @@ image = refiner(
 
 This ensemble of expert denoisers method works well for all available schedulers!
 
-### Base model to refiner model
+### Base to refiner model
 
 SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting.
 

From c0656c370b5b29cec753960a8d7a0cbed9abed8c Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 29 Aug 2023 11:15:07 -0400
Subject: [PATCH 06/11] micro-conditionings

---
 docs/source/en/using-diffusers/sdxl.md | 76 ++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 10 deletions(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 8cb755e1ba24..42cf4fca2b2b 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -293,30 +293,44 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
 
 For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
 
-## Use a different prompt for each text-encoder
+## Micro-conditioning
 
-SDXL uses two text-encoders so it is possible to pass a different prompt to each text-encoder which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompt):
+SDXL adds two addition conditioning techniques, which are referred to as *micro-conditioning*. The model is conditioned on image size so at inference, you can set a desired resolution for the generated image. The second conditioning is based on cropping parameters, allowing you to control how the generated image is cropped.
+
+<Tip>
+
+Size and crop-conditioning parameters can be used together to generate high-resolution images that are centered on a subject. These micro-conditionings and negative micro-conditionings are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImageToImagePipeline`], [`StableDiffusionXLInpaintingPipeline`], and [`StableDiffusionXLControlNetPipeline`].
+
+</Tip>
+
+## Size conditioning
+
+Size conditioning takes advantage of what SDXL has learned about image features at different resolutions during training to generate higher quality images during inference. You can experiment with this by adjusting the [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) and [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) parameters. By default, both parameters are set to 1024 to generate better images. If your `original_size` and `target_size` don't match, then the image is either down or upsampled to match the `target_size`.
+
+🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions:
 
 ```py
 from diffusers import StableDiffusionXLPipeline
 import torch
 
-pipeline = StableDiffusionXLPipeline.from_pretrained(
+pipe = StableDiffusionXLPipeline.from_pretrained(
     "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 
-# prompt is passed to OAI CLIP-ViT/L-14
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-# prompt_2 is passed to OpenCLIP-ViT/bigG-14
-prompt_2 = "Van Gogh painting"
-image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
+image = pipe(
+    prompt=prompt,
+    negative_original_size=(512, 512),
+    negative_target_size=(1024, 1024),
+).images[0]
 ```
 
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
+<div class="flex flex-col justify-center">
+  <img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png"/>
+  <figcaption class="text-center">Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
 </div>
 
-## Cropped image generation
+## Crop conditioning
 
 Images generated from previous Stable Diffusion models may sometimes appear to be randomly cropped due to how the model is trained. By conditioning SDXL on the cropping parameters, SDXL is able to generate images that are more centered and subjects in the images aren't randomly cut off. You can control the amount of cropping during inference with the [`crops_coords_top_left`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.crops_coords_top_left) parameter. By default, `crops_coords_top_left` is (0, 0) for a centered image.
 
@@ -337,6 +351,48 @@ image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0]
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-cropped.png" alt="generated image of an astronaut in a jungle, slightly cropped"/>
 </div>
 
+You can also specify negative cropping coordinates to steer generation away from certain cropping parameters:
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(
+    prompt=prompt,
+    negative_original_size=(512, 512),
+    negative_crops_coords_top_left=(0, 0),
+    negative_target_size=(1024, 1024),
+).images[0]
+```
+
+## Use a different prompt for each text-encoder
+
+SDXL uses two text-encoders so it is possible to pass a different prompt to each text-encoder which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompt):
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+# prompt is passed to OAI CLIP-ViT/L-14
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+# prompt_2 is passed to OpenCLIP-ViT/bigG-14
+prompt_2 = "Van Gogh painting"
+image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
+```
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
+</div>
+
 ## Optimizations
 
 SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference.

From 5c949bd19748edfc24250ec4674982be2c612693 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 29 Aug 2023 11:35:48 -0400
Subject: [PATCH 07/11] add tip

---
 .../en/api/pipelines/stable_diffusion/stable_diffusion_xl.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
index 94b1e5fe107c..d472825b8f18 100644
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -23,7 +23,7 @@ The abstract from the paper is:
 - SDXL works especially well with images between 768 and 1024.
 - SDXL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
 - SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
-
+- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
 
 <Tip>
 

From f0becc4528a34917cb9702990d7d16b1dc30c267 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 29 Aug 2023 12:01:02 -0400
Subject: [PATCH 08/11] fix section levels

---
 docs/source/en/using-diffusers/sdxl.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 42cf4fca2b2b..e6cb9532d290 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -303,7 +303,7 @@ Size and crop-conditioning parameters can be used together to generate high-reso
 
 </Tip>
 
-## Size conditioning
+### Size conditioning
 
 Size conditioning takes advantage of what SDXL has learned about image features at different resolutions during training to generate higher quality images during inference. You can experiment with this by adjusting the [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) and [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) parameters. By default, both parameters are set to 1024 to generate better images. If your `original_size` and `target_size` don't match, then the image is either down or upsampled to match the `target_size`.
 
@@ -330,7 +330,7 @@ image = pipe(
   <figcaption class="text-center">Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
 </div>
 
-## Crop conditioning
+### Crop conditioning
 
 Images generated from previous Stable Diffusion models may sometimes appear to be randomly cropped due to how the model is trained. By conditioning SDXL on the cropping parameters, SDXL is able to generate images that are more centered and subjects in the images aren't randomly cut off. You can control the amount of cropping during inference with the [`crops_coords_top_left`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.crops_coords_top_left) parameter. By default, `crops_coords_top_left` is (0, 0) for a centered image.
 

From 61c608b5c54e74f2a4c56759ba982d3d67d3fbfd Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 29 Aug 2023 13:28:34 -0400
Subject: [PATCH 09/11] d'oh fix pipeline names

---
 docs/source/en/using-diffusers/sdxl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index e6cb9532d290..2c48a364c1cd 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -299,7 +299,7 @@ SDXL adds two addition conditioning techniques, which are referred to as *micro-
 
 <Tip>
 
-Size and crop-conditioning parameters can be used together to generate high-resolution images that are centered on a subject. These micro-conditionings and negative micro-conditionings are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImageToImagePipeline`], [`StableDiffusionXLInpaintingPipeline`], and [`StableDiffusionXLControlNetPipeline`].
+Size and crop-conditioning parameters can be used together to generate high-resolution images that are centered on a subject. These micro-conditionings and negative micro-conditionings are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`].
 
 </Tip>
 

From 1d783259f2b8e9642fed053c7f2f12750f1b55e3 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Wed, 30 Aug 2023 10:37:29 -0400
Subject: [PATCH 10/11] feedback

---
 .../stable_diffusion/stable_diffusion_xl.md   |  2 +-
 docs/source/en/using-diffusers/sdxl.md        | 32 +++++++++++--------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
index d472825b8f18..e9fc9ae09380 100644
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -21,7 +21,7 @@ The abstract from the paper is:
 ## Tips
 
 - SDXL works especially well with images between 768 and 1024.
-- SDXL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
+- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
 - SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
 - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
 
diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 2c48a364c1cd..7e86cebf8fcc 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -29,33 +29,33 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
 
 ## Load model checkpoints
 
-Use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:
+Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
 
 ```py
 from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
 import torch
 
-pipeline = StableDiffusionXLPipeline.from_single_file(
-    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 
 refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
-    "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
 ).to("cuda")
 ```
 
-Model weights may also be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
+You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:
 
 ```py
 from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
 import torch
 
-pipeline = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+pipeline = StableDiffusionXLPipeline.from_single_file(
+    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
 ).to("cuda")
 
 refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
-    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+    "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
 ).to("cuda")
 ```
 
@@ -271,7 +271,7 @@ Generate an image from the base model, and set the model output to **latent** sp
 ```py
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 
-image = base(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
+image = base(prompt=prompt, output_type="latent").images[0]
 ```
 
 Pass the generated image to the refiner model:
@@ -295,11 +295,11 @@ For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline
 
 ## Micro-conditioning
 
-SDXL adds two addition conditioning techniques, which are referred to as *micro-conditioning*. The model is conditioned on image size so at inference, you can set a desired resolution for the generated image. The second conditioning is based on cropping parameters, allowing you to control how the generated image is cropped.
+SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images.
 
 <Tip>
 
-Size and crop-conditioning parameters can be used together to generate high-resolution images that are centered on a subject. These micro-conditionings and negative micro-conditionings are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`].
+You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`].
 
 </Tip>
 
@@ -307,6 +307,12 @@ Size and crop-conditioning parameters can be used together to generate high-reso
 
 Size conditioning takes advantage of what SDXL has learned about image features at different resolutions during training to generate higher quality images during inference. You can experiment with this by adjusting the [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) and [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) parameters. By default, both parameters are set to 1024 to generate better images. If your `original_size` and `target_size` don't match, then the image is either down or upsampled to match the `target_size`.
 
+There are two types of size conditioning:
+
+- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset.
+
+- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options!
+
 🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions:
 
 ```py
@@ -332,7 +338,7 @@ image = pipe(
 
 ### Crop conditioning
 
-Images generated from previous Stable Diffusion models may sometimes appear to be randomly cropped due to how the model is trained. By conditioning SDXL on the cropping parameters, SDXL is able to generate images that are more centered and subjects in the images aren't randomly cut off. You can control the amount of cropping during inference with the [`crops_coords_top_left`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.crops_coords_top_left) parameter. By default, `crops_coords_top_left` is (0, 0) for a centered image.
+Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions!
 
 ```py
 from diffusers import StableDiffusionXLPipeline
@@ -372,7 +378,7 @@ image = pipe(
 
 ## Use a different prompt for each text-encoder
 
-SDXL uses two text-encoders so it is possible to pass a different prompt to each text-encoder which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompt):
+SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts):
 
 ```py
 from diffusers import StableDiffusionXLPipeline

From 6cfec25bf80bb6b98e6462e93936dda8b2854824 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Wed, 30 Aug 2023 11:00:08 -0400
Subject: [PATCH 11/11] remove old section

---
 docs/source/en/using-diffusers/sdxl.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
index 7e86cebf8fcc..4ca02a4cc2c5 100644
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -305,8 +305,6 @@ You can use both micro-conditioning and negative micro-conditioning parameters t
 
 ### Size conditioning
 
-Size conditioning takes advantage of what SDXL has learned about image features at different resolutions during training to generate higher quality images during inference. You can experiment with this by adjusting the [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) and [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) parameters. By default, both parameters are set to 1024 to generate better images. If your `original_size` and `target_size` don't match, then the image is either down or upsampled to match the `target_size`.
-
 There are two types of size conditioning:
 
 - [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset.