Fix runtime error in dreambooth training script #6282

andrewssdd · 2023-12-22T00:37:52Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

This is a fix to a training example. So: @sayakpaul and @patrickvonplaten

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-12-22T00:45:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2023-12-22T01:37:35Z

@williamberman could you take a look here?

patrickvonplaten · 2023-12-26T20:19:49Z

Thanks for the PR @ctawong, can you explain a bit why the change is needed here? I don't understand #5932 as it's very messy and also am generally not a big fan of adding try-error statements

andrewssdd · 2023-12-26T21:12:53Z

The script errors out when the model to train from does not have a VAE. The original code attempted to handle it but failed.

The same bug was in the Lora training script in the same folder, and is fixed by PR #3462.

I don’t like the try except either but applied the same fix as in #3462 for consistency.

patrickvonplaten · 2023-12-26T21:45:49Z

The training script is meant to be for Stable Diffusion models which always have a VAE no? I think it'd be better to raise a nice error here instead

andrewssdd · 2023-12-27T00:57:49Z

The training script is meant for Stable Diffusion models with or without a VAE.

Model without VAE is supported, just that the VAE check failed. This PR fixes the VAE check.

andrewssdd · 2024-01-05T00:10:19Z

Can I merge this?

sayakpaul · 2024-01-05T01:39:06Z

The training script is meant for Stable Diffusion models with or without a VAE.

Is not true. This training script can be used with Deepfloyd IF, too. Since this script was tested rigorously against a few combination of models, I am afraid we won't be able to consider these changes.

andrewssdd · 2024-01-05T02:09:56Z

this script was tested rigorously against a few combination of models

As it stands now, the script doesn't even train from the SD 1.5 model runwayml/stable-diffusion-v1-5

sayakpaul · 2024-01-05T02:11:20Z

The fast tests don't seem to tell me that. They run fine. For what combination of CLI args the script doesn't run?

andrewssdd · 2024-01-05T02:16:09Z

Tested on Windows

$MODEL_NAME="runwayml/stable-diffusion-v1-5"
$INSTANCE_DIR="dog"
$OUTPUT_DIR="dreambooth/model"

accelerate launch .\examples\dreambooth\train_dreambooth.py `
  --pretrained_model_name_or_path=$MODEL_NAME  `
  --instance_data_dir=$INSTANCE_DIR `
  --output_dir=$OUTPUT_DIR `
  --instance_prompt="a photo of sks dog" `
  --resolution=512 `
  --train_batch_size=1 `
  --gradient_accumulation_steps=1 `
  --learning_rate=5e-6 `
  --lr_scheduler="constant" `
  --lr_warmup_steps=0 `
  --max_train_steps=400 `

Error message:

Steps:   0%|                                                                                   | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\XXXXXXX\git\diffusers\examples\dreambooth\train_dreambooth.py", line 1428, in <module>
    main(args)
  File "C:\Users\XXXXXXX\git\diffusers\examples\dreambooth\train_dreambooth.py", line 1258, in main
    model_pred = unet(
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\utils\operations.py", line 680, in forward
    return model_forward(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\utils\operations.py", line 668, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\src\diffusers\models\unet_2d_condition.py", line 1072, in forward
    sample = self.conv_in(sample)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [320, 4, 3, 3], expected input[1, 3, 512, 512] to have 4 channels, but got 3 channels instead
Steps:   0%|                                                                                   | 0/400 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\XXXXXXX\git\diffusers\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\XXXXXXX\\git\\diffusers\\venv\\Scripts\\python.exe', '.\\examples\\dreambooth\\train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=dog', '--output_dir=dreambooth/model', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400']' returned non-zero exit status 1.

The training is successful with this fix, which I stole from the LoRA training script which shared many codes with this script.

sayakpaul · 2024-01-05T02:18:28Z

Can you try updating huggingface_hub and see if that solves your problem? Essentially what you're doing can be accomplished by the use of model_info from the Hugging Face Hub library. I would try to debug that in isolation and see if that's giving expected outputs.

andrewssdd · 2024-01-05T02:25:55Z

Hi, it is already the latest version huggingface-hub 0.20.1

…

On Thu, Jan 4, 2024 at 9:18 PM Sayak Paul ***@***.***> wrote: Can you try updating huggingface_hub and see if that solves your problem? Essentially what you're doing can be accomplished by the use of model_info from the Hugging Face Hub library. I would try to debug that in isolation and see if that's giving expected outputs. — Reply to this email directly, view it on GitHub <#6282 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZKQZKPN3SZMFY33K62OILYM5PH7AVCNFSM6AAAAABA7EMBMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGAZTCNRSGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sayakpaul · 2024-01-05T02:26:52Z

Then this still stands:

Essentially what you're doing can be accomplished by the use of model_info from the Hugging Face Hub library. I would try to debug that in isolation and see if that's giving expected outputs.

github-actions · 2024-01-29T15:05:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fix runtime error number of channel mismatch

38a8416

sayakpaul requested a review from williamberman December 22, 2023 01:37

github-actions bot added the stale Issues that haven't received updates label Jan 29, 2024

github-actions bot closed this Feb 6, 2024

Fix runtime error in dreambooth training script #6282

Fix runtime error in dreambooth training script #6282

Uh oh!

Conversation

andrewssdd commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 22, 2023

Uh oh!

sayakpaul commented Dec 22, 2023

Uh oh!

patrickvonplaten commented Dec 26, 2023

Uh oh!

andrewssdd commented Dec 26, 2023

Uh oh!

patrickvonplaten commented Dec 26, 2023

Uh oh!

andrewssdd commented Dec 27, 2023

Uh oh!

andrewssdd commented Jan 5, 2024

Uh oh!

sayakpaul commented Jan 5, 2024

Uh oh!

andrewssdd commented Jan 5, 2024

Uh oh!

sayakpaul commented Jan 5, 2024

Uh oh!

andrewssdd commented Jan 5, 2024

Uh oh!

sayakpaul commented Jan 5, 2024

Uh oh!

andrewssdd commented Jan 5, 2024 via email

Uh oh!

sayakpaul commented Jan 5, 2024

Uh oh!

github-actions bot commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewssdd commented Dec 22, 2023 •

edited

Loading