-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
After a system crash I attempted to resume from a prior checkpoint.
Expected Result:
Continues training from last checkpoint and finishes successfully
Actual Result:
File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 2104, in <module>
main(args)
File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 1861, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Reproduction
!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--report_to="wandb" \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--dataset_name="./training_images" \
--output_dir="father_lora_v9" \
--cache_dir="./dataset_cache_dir" \
--caption_column="prompt" \
--mixed_precision="fp16" \
--instance_prompt="a photo of Brian de palma man" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--snr_gamma=5.0 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1680 \
--checkpointing_steps=200 \
--validation_prompt="a photo of Brian de palma in a suit" \
--validation_epochs=10 \
--train_text_encoder \
--with_prior_preservation \
--class_data_dir="./prior_preservation" \
--num_class_images=150 \
--class_prompt="a photo of an old man" \
--rank=32 \
--optimizer="prodigy" \
--prodigy_safeguard_warmup=True \
--prodigy_use_bias_correction=True \
--adam_beta1=0.9 \
--adam_beta2=0.99 \
--adam_weight_decay=0.01 \
--train_text_encoder \
--learning_rate=1 \
--text_encoder_lr=1 \
--resume_from_checkpoint="checkpoint-1600" \
--seed="0"
Logs
Steps: 95%|████████████████████ | 1600/1680 [00:05<?, ?it/s, loss=0.0168, lr=1]Traceback (most recent call last):
File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 2104, in <module>
main(args)
File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 1861, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.System Info
diffusers` version: 0.26.0.dev0
- Platform: Linux-5.4.0-156-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Huggingface_hub version: 0.20.2
- Transformers version: 4.36.2
- Accelerate version: 0.25.0
- xFormers version: 0.0.23.post1
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working