Skip to content

train_dreambooth_lora_sdxl_advanced.py --resume_from_checkpoint fails with ValueError: Attempting to unscale FP16 gradients. #6482

@steverhoades

Description

@steverhoades

Describe the bug

After a system crash I attempted to resume from a prior checkpoint.

Expected Result:
Continues training from last checkpoint and finishes successfully

Actual Result:

 File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 2104, in <module>
    main(args)
  File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 1861, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

Reproduction

!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
  --report_to="wandb" \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="./training_images" \
  --output_dir="father_lora_v9" \
  --cache_dir="./dataset_cache_dir" \
  --caption_column="prompt" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of Brian de palma man" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4   \
  --gradient_checkpointing \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1680 \
  --checkpointing_steps=200 \
  --validation_prompt="a photo of Brian de palma in a suit" \
  --validation_epochs=10 \
  --train_text_encoder \
  --with_prior_preservation \
  --class_data_dir="./prior_preservation" \
  --num_class_images=150 \
  --class_prompt="a photo of an old man" \
  --rank=32 \
  --optimizer="prodigy" \
  --prodigy_safeguard_warmup=True \
  --prodigy_use_bias_correction=True \
  --adam_beta1=0.9 \
  --adam_beta2=0.99 \
  --adam_weight_decay=0.01 \
  --train_text_encoder \
  --learning_rate=1 \
  --text_encoder_lr=1 \
  --resume_from_checkpoint="checkpoint-1600" \
  --seed="0"

Logs

Steps:  95%|████████████████████ | 1600/1680 [00:05<?, ?it/s, loss=0.0168, lr=1]Traceback (most recent call last):
  File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 2104, in <module>
    main(args)
  File "/workspace/train_dreambooth_lora_sdxl_advanced.py", line 1861, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

System Info

diffusers` version: 0.26.0.dev0

  • Platform: Linux-5.4.0-156-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Huggingface_hub version: 0.20.2
  • Transformers version: 4.36.2
  • Accelerate version: 0.25.0
  • xFormers version: 0.0.23.post1
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions