Issue "ValueError: Attempting to unscale FP16 gradients." when running SDXL_DreamBooth_LoRA_.ipynb

### Describe the bug

I am trying to run the famous colab notebook SDXL_DreamBooth_LoRA_.ipynb to build a dreambooth model out of sdxl + vae using accelerate launch train_dreambooth_lora_sdxl.py.

Yet, i am ending with this error which is frustrating.

### Reproduction

Running only this notebook without changes : https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb

I have just added the peft installation after having the error that the package is missing "!pip install --upgrade peft"

I am on Google Colab Pro+ V100, 50GB RAM, 16GB VRAM.

Vesion of the train_dreambooth_lora_sdxl.py comes from the command !wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py.

#!/usr/bin/env bash
!accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="dog" \
  --output_dir="corgy_dog_LoRA" \
  --caption_column="prompt"\
  --mixed_precision="fp16" \
  --instance_prompt="a photo of TOK dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=3 \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --max_train_steps=500 \
  --checkpointing_steps=717 \
  --seed="0"

### Logs

```shell
Here is the error "ValueError: Attempting to unscale FP16 gradients." while running the command accelerate launch train_dreambooth_lora_sdxl.py...

2023-12-10 17:31:51.597271: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-10 17:31:51.597330: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-10 17:31:51.597385: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-10 17:31:52.726226: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
12/10/2023 17:31:53 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

tokenizer/tokenizer_config.json: 100% 737/737 [00:00<00:00, 3.82MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 1.36MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 902kB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.49MB/s]
tokenizer_2/tokenizer_config.json: 100% 725/725 [00:00<00:00, 4.13MB/s]
tokenizer_2/special_tokens_map.json: 100% 460/460 [00:00<00:00, 2.42MB/s]
text_encoder/config.json: 100% 565/565 [00:00<00:00, 3.52MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
text_encoder_2/config.json: 100% 575/575 [00:00<00:00, 3.55MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 479/479 [00:00<00:00, 2.47MB/s]
{'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:07<00:00, 67.5MB/s]
model.safetensors: 100% 2.78G/2.78G [00:46<00:00, 59.8MB/s]
config.json: 100% 631/631 [00:00<00:00, 3.59MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:18<00:00, 18.0MB/s]
unet/config.json: 100% 1.68k/1.68k [00:00<00:00, 7.97MB/s]
diffusion_pytorch_model.safetensors: 100% 10.3G/10.3G [02:29<00:00, 68.5MB/s]
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Downloading data files: 100% 6/6 [00:00<00:00, 35444.82it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
Generating train split: 5 examples [00:00, 91.51 examples/s]
12/10/2023 17:36:04 - INFO - __main__ - ***** Running training *****
12/10/2023 17:36:04 - INFO - __main__ -   Num examples = 5
12/10/2023 17:36:04 - INFO - __main__ -   Num batches each epoch = 5
12/10/2023 17:36:04 - INFO - __main__ -   Num Epochs = 250
12/10/2023 17:36:04 - INFO - __main__ -   Instantaneous batch size per device = 1
12/10/2023 17:36:04 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
12/10/2023 17:36:04 - INFO - __main__ -   Gradient Accumulation steps = 3
12/10/2023 17:36:04 - INFO - __main__ -   Total optimization steps = 500
Steps:   0% 0/500 [00:03<?, ?it/s, loss=0.162, lr=0.0001] Traceback (most recent call last):
  File "/content/train_dreambooth_lora_sdxl.py", line 1716, in <module>
    main(args)
  File "/content/train_dreambooth_lora_sdxl.py", line 1494, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0% 0/500 [00:05<?, ?it/s, loss=0.162, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--dataset_name=dog', '--output_dir=corgy_dog_LoRA', '--caption_column=prompt', '--mixed_precision=fp16', '--instance_prompt=a photo of TOK dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=3', '--gradient_checkpointing', '--learning_rate=1e-4', '--snr_gamma=5.0', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--mixed_precision=fp16', '--use_8bit_adam', '--max_train_steps=500', '--checkpointing_steps=717', '--seed=0']' returned non-zero exit status 1.
```


### System Info

!diffusers-cli env
- `diffusers` version: 0.25.0.dev0
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- Huggingface_hub version: 0.19.4
- Transformers version: 4.35.2
- Accelerate version: 0.25.0
- xFormers version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    38W / 300W |   1784MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

### Who can help?

@sayakpaul @patrickvonplaten I am wondering if you can help me on this error while running dreambooth lora on sdxlusing this accelerate launch train_dreambooth_lora_sdxl.py. 

FYI. I tried to exclude vae and I had the same issue.

N.B. I am trying to find one working for my students.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue "ValueError: Attempting to unscale FP16 gradients." when running SDXL_DreamBooth_LoRA_.ipynb #6124

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue "ValueError: Attempting to unscale FP16 gradients." when running SDXL_DreamBooth_LoRA_.ipynb #6124

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions