Skip to content

Conversation

@andrewssdd
Copy link

@andrewssdd andrewssdd commented Dec 22, 2023

What does this PR do?

Fixes #5932

Before submitting

Who can review?

This is a fix to a training example. So: @sayakpaul and @patrickvonplaten

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul
Copy link
Member

@williamberman could you take a look here?

@patrickvonplaten
Copy link
Contributor

Thanks for the PR @ctawong, can you explain a bit why the change is needed here? I don't understand #5932 as it's very messy and also am generally not a big fan of adding try-error statements

@andrewssdd
Copy link
Author

The script errors out when the model to train from does not have a VAE. The original code attempted to handle it but failed.

The same bug was in the Lora training script in the same folder, and is fixed by PR #3462.

I don’t like the try except either but applied the same fix as in #3462 for consistency.

@patrickvonplaten
Copy link
Contributor

The training script is meant to be for Stable Diffusion models which always have a VAE no? I think it'd be better to raise a nice error here instead

@andrewssdd
Copy link
Author

The training script is meant for Stable Diffusion models with or without a VAE.

Model without VAE is supported, just that the VAE check failed. This PR fixes the VAE check.

@andrewssdd
Copy link
Author

Can I merge this?

@sayakpaul
Copy link
Member

The training script is meant for Stable Diffusion models with or without a VAE.

Is not true. This training script can be used with Deepfloyd IF, too. Since this script was tested rigorously against a few combination of models, I am afraid we won't be able to consider these changes.

@andrewssdd
Copy link
Author

this script was tested rigorously against a few combination of models

As it stands now, the script doesn't even train from the SD 1.5 model runwayml/stable-diffusion-v1-5

@sayakpaul
Copy link
Member

The fast tests don't seem to tell me that. They run fine. For what combination of CLI args the script doesn't run?

@andrewssdd
Copy link
Author

Tested on Windows

$MODEL_NAME="runwayml/stable-diffusion-v1-5"
$INSTANCE_DIR="dog"
$OUTPUT_DIR="dreambooth/model"

accelerate launch .\examples\dreambooth\train_dreambooth.py `
  --pretrained_model_name_or_path=$MODEL_NAME  `
  --instance_data_dir=$INSTANCE_DIR `
  --output_dir=$OUTPUT_DIR `
  --instance_prompt="a photo of sks dog" `
  --resolution=512 `
  --train_batch_size=1 `
  --gradient_accumulation_steps=1 `
  --learning_rate=5e-6 `
  --lr_scheduler="constant" `
  --lr_warmup_steps=0 `
  --max_train_steps=400 `

Error message:

Steps:   0%|                                                                                   | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\XXXXXXX\git\diffusers\examples\dreambooth\train_dreambooth.py", line 1428, in <module>
    main(args)
  File "C:\Users\XXXXXXX\git\diffusers\examples\dreambooth\train_dreambooth.py", line 1258, in main
    model_pred = unet(
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\utils\operations.py", line 680, in forward
    return model_forward(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\utils\operations.py", line 668, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\src\diffusers\models\unet_2d_condition.py", line 1072, in forward
    sample = self.conv_in(sample)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [320, 4, 3, 3], expected input[1, 3, 512, 512] to have 4 channels, but got 3 channels instead
Steps:   0%|                                                                                   | 0/400 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\XXXXXXX\git\diffusers\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "C:\Users\XXXXXXX\git\diffusers\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\XXXXXXX\\git\\diffusers\\venv\\Scripts\\python.exe', '.\\examples\\dreambooth\\train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=dog', '--output_dir=dreambooth/model', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400']' returned non-zero exit status 1.

The training is successful with this fix, which I stole from the LoRA training script which shared many codes with this script.

@sayakpaul
Copy link
Member

Can you try updating huggingface_hub and see if that solves your problem? Essentially what you're doing can be accomplished by the use of model_info from the Hugging Face Hub library. I would try to debug that in isolation and see if that's giving expected outputs.

@andrewssdd
Copy link
Author

andrewssdd commented Jan 5, 2024 via email

@sayakpaul
Copy link
Member

Then this still stands:

Essentially what you're doing can be accomplished by the use of model_info from the Hugging Face Hub library. I would try to debug that in isolation and see if that's giving expected outputs.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 29, 2024
@github-actions github-actions bot closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: Given groups=1, weight of size [320, 4, 3, 3], expected input[1, 3, 512, 512] to have 4 channels, but got 3 channels instead

4 participants