-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm having issues with DreamBooth, but only locally — the model doesn't seem to learn the instance. However, it does work in Colab (this one).
Reproduction
Since I can't run the example on my GPU (not enough VRAM), I've tried to run it on the CPU with this command:
accelerate launch --cpu train_dreambooth.py \
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --use_auth_token \
--instance_data_dir="images" \
--class_data_dir="class-images" \
--output_dir="model" \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="photo of sks woman" \
--class_prompt="photo of a woman" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--sample_batch_size=2I've also tried to run ShivamShrirao's fork (specifically, using this dockerized version), which does fit on the GPU with --use_8bit_adam and --gradient_checkpointing:
docker run -it --gpus=all -v (pwd):/train -e HUGGING_FACE_HUB_TOKEN=(cat ~/.huggingface/token) smy20011/dreambooth:latest \
accelerate launch /train_dreambooth.py \
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" --use_auth_token \
--instance_data_dir="images" \
--class_data_dir="class-images" \
--output_dir="model" \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="photo of sks woman" \
--class_prompt="photo of a woman" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 \
--gradient_checkpointing \
--use_8bit_adam \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--sample_batch_size=2 \
--mixed_precision="fp16"In both cases, the results were the same: the model went through 800 steps of training and didn't seem to learn the instance.
However, the notebook version from ShivamShrirao's fork did learn the instance when I've run it on Colab.
Logs
The logs from the CPU run looked like this:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--num_cpu_threads_per_process` was set to `16` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Steps: 1%|█ | 6/800 [06:18<13:36:47, 61.72s/it, loss=0.212, lr=5e-6]
(Don't have the original logs, but they looked exactly like this, no errors or warnings. It'll take me several hours to reproduce the logs of a full run, but I can do that if necessary.)System Info
diffusersversion: 0.4.0.dev0- Platform: Linux-5.13.0-44-generic-x86_64-with-glibc2.34
- Python version: 3.9.7
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Huggingface_hub version: 0.10.0
- Transformers version: 4.22.2
- Using GPU in script?: No in the original script, Yes in the fork.
- Using distributed or parallel set-up in script?: No
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working