-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Describe the bug
both examples/dreambooth/train_dreambooth_lora_sdxl.py and examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py seem to have an issue when resuming training from a previously saved checkpoint.
Training and saving checkpoints seems to work correctly, however, when resuming from a previously saved checkpoint, the following messages are produced at script startup:
Resuming from checkpoint checkpoint-10
12/27/2023 16:29:22 - INFO - accelerate.accelerator - Loading states from xqc/checkpoint-10
Loading unet.
12/27/2023 16:29:22 - INFO - peft.tuners.tuners_utils - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
Loading text_encoder.
12/27/2023 16:29:23 - INFO - peft.tuners.tuners_utils - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
Training appears to continue normally, however, all new checkpoints saved after this will be significantly larger than the previous checkpoints:
(xl) localhost /media/nvme/xl/diffusers/examples/dreambooth # du -sch xqc/*
87M xqc/checkpoint-10
110M xqc/checkpoint-15
110M xqc/checkpoint-20
110M xqc/checkpoint-25
87M xqc/checkpoint-5
88K xqc/logs
494M total
Once training with a resumed checkpoint is completed, there will be a large dump of layer names with a message saying that the model contains layers that do not match. (Full error message below)
To me, this looks like the checkpoints are being loaded incorrectly and ignored, and then a new adapter is being trained from scratch, and then both versions, old and new, are saved in the final lora.
Reproduction
To reproduce this issue, follow the following steps:
- Run either train_dreambooth_lora_sdxl*.py script with appropriate parameters, including
--checkpointing_steps(preferably set to a low number to reproduce this issue quickly). - After at least one or two checkpoints have been saved, either stop the script or wait for it to complete.
- Rerun the same script, but also include the
--resume_from_checkpoint latestor--resume_from_checkpoint checkpoint-x. - Observe the effects listed above (PEFT warning message on startup, later checkpoint file sizes)
- After resumed training is completed, attempt to load the finished lora. (inference will be successful, but lora performance does not seem correct).
- Observe the error message produced.
Logs
My full command-line with all arguments looks like this:
python train_dreambooth_lora_sdxl.py --pretrained_model_name_or_path ../../../models/colossus_v5.3 --instance_data_dir /media/nvme/datasets/combined/ --output_dir xqc --resolution 1024 --instance_prompt 'a photo of hxq' --train_text_encoder --num_train_epochs 1 --train_batch_size 1 --gradient_checkpointing --checkpointing_steps 5 --gradient_accumulation_steps 1 --learning_rate 0.0001 --resume_from_checkpoint latestError produced during inference with the affected lora (truncated because of length):
Loading adapter weights from state_dict led to unexpected keys not found in the model: ['down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn1.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.0.transformer_blocks.1.attn2.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_q.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_q.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_k.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_k.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_v.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_v.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_out.0.lora_A_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn1.to_out.0.lora_B_1.default_0.weight', 'down_blocks.1.attentions.1.transformer_blocks.1.attn2.to_q.lora_A_1.default_0.weight',
*** TRUNCATED HERE ***
'mid_block.attentions.0.transformer_blocks.8.attn1.to_k.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn1.to_k.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn1.to_v.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn1.to_v.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn1.to_out.0.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn1.to_out.0.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_q.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_q.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_k.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_k.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_v.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_v.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_out.0.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.8.attn2.to_out.0.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_q.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_q.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_k.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_k.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_v.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_v.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_out.0.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn1.to_out.0.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_q.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_q.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_k.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_k.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_v.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_v.lora_B_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_out.0.lora_A_1.default_0.weight', 'mid_block.attentions.0.transformer_blocks.9.attn2.to_out.0.lora_B_1.default_0.weight'].
Loading adapter weights from None led to unexpected keys not found in the model: ['text_model.encoder.layers.0.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.0.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.1.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.2.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.3.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.4.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.5.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.6.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.7.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.8.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.9.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.10.self_attn.out_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.k_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.k_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.v_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.v_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.q_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.q_proj.lora_B_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.out_proj.lora_A_1.default_0.weight', 'text_model.encoder.layers.11.self_attn.out_proj.lora_B_1.default_0.weight'].
### System Info
Latest diffusers - master branch pulled on 2023/12/27.
OS - Linux 6.1.9
(xl) localhost /media/nvme/xl # uname -a
Linux localhost 6.1.9-noinitramfs #4 SMP PREEMPT_DYNAMIC Fri Feb 10 03:01:14 -00 2023 x86_64 Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz GenuineIntel GNU/Linux
python - Python 3.10.9
(xl) localhost /media/nvme/xl # diffusers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
diffusersversion: 0.25.0.dev0- Platform: Linux-6.1.9-noinitramfs-x86_64-Intel-R-_Core-TM-i5-9500T_CPU@_2.20GHz-with-glibc2.36
- Python version: 3.10.9
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Huggingface_hub version: 0.20.1
- Transformers version: 4.36.2
- Accelerate version: 0.23.0
- xFormers version: 0.0.23.post1
- Using GPU in script?: No (however, I believe it will occur on GPU as well)
- Using distributed or parallel set-up in script?: No
### Who can help?
_No response_