Skip to content

Issue "ValueError: Attempting to unscale FP16 gradients." when running SDXL_DreamBooth_LoRA_.ipynbΒ #6124

@yleoed

Description

@yleoed

Describe the bug

I am trying to run the famous colab notebook SDXL_DreamBooth_LoRA_.ipynb to build a dreambooth model out of sdxl + vae using accelerate launch train_dreambooth_lora_sdxl.py.

Yet, i am ending with this error which is frustrating.

Reproduction

Running only this notebook without changes : https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb

I have just added the peft installation after having the error that the package is missing "!pip install --upgrade peft"

I am on Google Colab Pro+ V100, 50GB RAM, 16GB VRAM.

Vesion of the train_dreambooth_lora_sdxl.py comes from the command !wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py.

#!/usr/bin/env bash
!accelerate launch train_dreambooth_lora_sdxl.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
--dataset_name="dog"
--output_dir="corgy_dog_LoRA"
--caption_column="prompt"
--mixed_precision="fp16"
--instance_prompt="a photo of TOK dog"
--resolution=1024
--train_batch_size=1
--gradient_accumulation_steps=3
--gradient_checkpointing
--learning_rate=1e-4
--snr_gamma=5.0
--lr_scheduler="constant"
--lr_warmup_steps=0
--mixed_precision="fp16"
--use_8bit_adam
--max_train_steps=500
--checkpointing_steps=717
--seed="0"

Logs

Here is the error "ValueError: Attempting to unscale FP16 gradients." while running the command accelerate launch train_dreambooth_lora_sdxl.py...

2023-12-10 17:31:51.597271: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-10 17:31:51.597330: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-10 17:31:51.597385: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-10 17:31:52.726226: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
12/10/2023 17:31:53 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

tokenizer/tokenizer_config.json: 100% 737/737 [00:00<00:00, 3.82MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 1.36MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 902kB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.49MB/s]
tokenizer_2/tokenizer_config.json: 100% 725/725 [00:00<00:00, 4.13MB/s]
tokenizer_2/special_tokens_map.json: 100% 460/460 [00:00<00:00, 2.42MB/s]
text_encoder/config.json: 100% 565/565 [00:00<00:00, 3.52MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
text_encoder_2/config.json: 100% 575/575 [00:00<00:00, 3.55MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 479/479 [00:00<00:00, 2.47MB/s]
{'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:07<00:00, 67.5MB/s]
model.safetensors: 100% 2.78G/2.78G [00:46<00:00, 59.8MB/s]
config.json: 100% 631/631 [00:00<00:00, 3.59MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:18<00:00, 18.0MB/s]
unet/config.json: 100% 1.68k/1.68k [00:00<00:00, 7.97MB/s]
diffusion_pytorch_model.safetensors: 100% 10.3G/10.3G [02:29<00:00, 68.5MB/s]
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Downloading data files: 100% 6/6 [00:00<00:00, 35444.82it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
Generating train split: 5 examples [00:00, 91.51 examples/s]
12/10/2023 17:36:04 - INFO - __main__ - ***** Running training *****
12/10/2023 17:36:04 - INFO - __main__ -   Num examples = 5
12/10/2023 17:36:04 - INFO - __main__ -   Num batches each epoch = 5
12/10/2023 17:36:04 - INFO - __main__ -   Num Epochs = 250
12/10/2023 17:36:04 - INFO - __main__ -   Instantaneous batch size per device = 1
12/10/2023 17:36:04 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
12/10/2023 17:36:04 - INFO - __main__ -   Gradient Accumulation steps = 3
12/10/2023 17:36:04 - INFO - __main__ -   Total optimization steps = 500
Steps:   0% 0/500 [00:03<?, ?it/s, loss=0.162, lr=0.0001] Traceback (most recent call last):
  File "/content/train_dreambooth_lora_sdxl.py", line 1716, in <module>
    main(args)
  File "/content/train_dreambooth_lora_sdxl.py", line 1494, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0% 0/500 [00:05<?, ?it/s, loss=0.162, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--dataset_name=dog', '--output_dir=corgy_dog_LoRA', '--caption_column=prompt', '--mixed_precision=fp16', '--instance_prompt=a photo of TOK dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=3', '--gradient_checkpointing', '--learning_rate=1e-4', '--snr_gamma=5.0', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--mixed_precision=fp16', '--use_8bit_adam', '--max_train_steps=500', '--checkpointing_steps=717', '--seed=0']' returned non-zero exit status 1.

System Info

!diffusers-cli env

  • diffusers version: 0.25.0.dev0
  • Platform: Linux-5.15.120+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.1.0+cu118 (True)
  • Huggingface_hub version: 0.19.4
  • Transformers version: 4.35.2
  • Accelerate version: 0.25.0
  • xFormers version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 38W / 300W | 1784MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Who can help?

@sayakpaul @patrickvonplaten I am wondering if you can help me on this error while running dreambooth lora on sdxlusing this accelerate launch train_dreambooth_lora_sdxl.py.

FYI. I tried to exclude vae and I had the same issue.

N.B. I am trying to find one working for my students.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions