-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Describe the bug
I am trying to run the famous colab notebook SDXL_DreamBooth_LoRA_.ipynb to build a dreambooth model out of sdxl + vae using accelerate launch train_dreambooth_lora_sdxl.py.
Yet, i am ending with this error which is frustrating.
Reproduction
Running only this notebook without changes : https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb
I have just added the peft installation after having the error that the package is missing "!pip install --upgrade peft"
I am on Google Colab Pro+ V100, 50GB RAM, 16GB VRAM.
Vesion of the train_dreambooth_lora_sdxl.py comes from the command !wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py.
#!/usr/bin/env bash
!accelerate launch train_dreambooth_lora_sdxl.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
--dataset_name="dog"
--output_dir="corgy_dog_LoRA"
--caption_column="prompt"
--mixed_precision="fp16"
--instance_prompt="a photo of TOK dog"
--resolution=1024
--train_batch_size=1
--gradient_accumulation_steps=3
--gradient_checkpointing
--learning_rate=1e-4
--snr_gamma=5.0
--lr_scheduler="constant"
--lr_warmup_steps=0
--mixed_precision="fp16"
--use_8bit_adam
--max_train_steps=500
--checkpointing_steps=717
--seed="0"
Logs
Here is the error "ValueError: Attempting to unscale FP16 gradients." while running the command accelerate launch train_dreambooth_lora_sdxl.py...
2023-12-10 17:31:51.597271: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-10 17:31:51.597330: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-10 17:31:51.597385: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-10 17:31:52.726226: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
12/10/2023 17:31:53 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
tokenizer/tokenizer_config.json: 100% 737/737 [00:00<00:00, 3.82MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 1.36MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 902kB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.49MB/s]
tokenizer_2/tokenizer_config.json: 100% 725/725 [00:00<00:00, 4.13MB/s]
tokenizer_2/special_tokens_map.json: 100% 460/460 [00:00<00:00, 2.42MB/s]
text_encoder/config.json: 100% 565/565 [00:00<00:00, 3.52MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
text_encoder_2/config.json: 100% 575/575 [00:00<00:00, 3.55MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 479/479 [00:00<00:00, 2.47MB/s]
{'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:07<00:00, 67.5MB/s]
model.safetensors: 100% 2.78G/2.78G [00:46<00:00, 59.8MB/s]
config.json: 100% 631/631 [00:00<00:00, 3.59MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:18<00:00, 18.0MB/s]
unet/config.json: 100% 1.68k/1.68k [00:00<00:00, 7.97MB/s]
diffusion_pytorch_model.safetensors: 100% 10.3G/10.3G [02:29<00:00, 68.5MB/s]
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Downloading data files: 100% 6/6 [00:00<00:00, 35444.82it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
Generating train split: 5 examples [00:00, 91.51 examples/s]
12/10/2023 17:36:04 - INFO - __main__ - ***** Running training *****
12/10/2023 17:36:04 - INFO - __main__ - Num examples = 5
12/10/2023 17:36:04 - INFO - __main__ - Num batches each epoch = 5
12/10/2023 17:36:04 - INFO - __main__ - Num Epochs = 250
12/10/2023 17:36:04 - INFO - __main__ - Instantaneous batch size per device = 1
12/10/2023 17:36:04 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
12/10/2023 17:36:04 - INFO - __main__ - Gradient Accumulation steps = 3
12/10/2023 17:36:04 - INFO - __main__ - Total optimization steps = 500
Steps: 0% 0/500 [00:03<?, ?it/s, loss=0.162, lr=0.0001] Traceback (most recent call last):
File "/content/train_dreambooth_lora_sdxl.py", line 1716, in <module>
main(args)
File "/content/train_dreambooth_lora_sdxl.py", line 1494, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0% 0/500 [00:05<?, ?it/s, loss=0.162, lr=0.0001]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--dataset_name=dog', '--output_dir=corgy_dog_LoRA', '--caption_column=prompt', '--mixed_precision=fp16', '--instance_prompt=a photo of TOK dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=3', '--gradient_checkpointing', '--learning_rate=1e-4', '--snr_gamma=5.0', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--mixed_precision=fp16', '--use_8bit_adam', '--max_train_steps=500', '--checkpointing_steps=717', '--seed=0']' returned non-zero exit status 1.System Info
!diffusers-cli env
diffusersversion: 0.25.0.dev0- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- Huggingface_hub version: 0.19.4
- Transformers version: 4.35.2
- Accelerate version: 0.25.0
- xFormers version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 38W / 300W | 1784MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Who can help?
@sayakpaul @patrickvonplaten I am wondering if you can help me on this error while running dreambooth lora on sdxlusing this accelerate launch train_dreambooth_lora_sdxl.py.
FYI. I tried to exclude vae and I had the same issue.
N.B. I am trying to find one working for my students.