Conversation
`--loraplus_ratio` added for both TE and UNet Add log for lora+
When trying to load stored latents, if an error occurs, this change will tell you what file failed to load Currently it will just tell you that something failed without telling you which file
This can be used to train away from a group of images you don't want As this moves the model away from a point instead of towards it, the change in the model is unbounded So, don't set it too low. -4e-7 seemed to work well.
If a latent file fails to load, print out the path and the error, then return false to regenerate it
Add LoRA+ support
Adafactor fused backward pass and optimizer step, lowers SDXL (@ 1024 resolution) VRAM usage to BF16(10GB)/FP32(16.4GB)
Bug fix: alpha_mask load
Make timesteps work in the standard way when Huber loss is used
New optimizer:AdEMAMix8bit and PagedAdEMAMix8bit
1) Updates debiased estimation loss function for V-pred. 2) Prevents now-deprecated scaling of loss if ztSNR is enabled.
Different model architectures, such as SDXL, can take advantage of v-pred. It doesn't make sense to include these warnings anymore.
Update debiased estimation loss function to accommodate V-pred
Remove v-pred warnings
|
Impressive update, thanks to all the contributors! |
|
@kohya-ss amazing work I tested Fused backward pass on SDXL with Adafactor and it reduced as low as 10200 MB I also tried Fused optimizer groups = 10 and it was like 10500 MB However when enabling Fused backward pass + block swaps it didn't make any more difference Can I reduce VRAM usage any further? SDXL training at 1024x1024 I can train below 8 GB GPUs FLUX dev with block swaps |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
important The dependent libraries are updated. Please see Upgrade and update the libraries.
Fixed a bug where the loss weight was incorrect when
--debiased_estimation_losswas specified with--v_parameterization. PR #1715 Thanks to catboxanon! See the PR for details.--v_parameterizationis specified in SDXL and SD1.5. PR #1717There was a bug where the min_bucket_reso/max_bucket_reso in the dataset configuration did not create the correct resolution bucket if it was not divisible by bucket_reso_steps. These values are now warned and automatically rounded to a divisible value. Thanks to Maru-mee for raising the issue. Related PR #1632
bitsandbytesis updated to 0.44.0. Now you can useAdEMAMix8bitandPagedAdEMAMix8bitin the training script. PR #1640 Thanks to sdbds!--optimizer_type bitsandbytes.optim.AdEMAMix8bit(not bnb but bitsandbytes).Fixed a bug in the cache of latents. When
flip_aug,alpha_mask, andrandom_cropare different in multiple subsets in the dataset configuration file (.toml), the last subset is used instead of reflecting them correctly.Fixed an issue where the timesteps in the batch were the same when using Huber loss. PR #1628 Thanks to recris!
Improvements in OFT (Orthogonal Finetuning) Implementation
These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.
Additional Information
Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α*out_dim*out_dim)". Our implementation uses "(α*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.
Performance Improvement: Training speed has been improved by approximately 30%.
Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).
The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR #1393 for details. Thanks to sdbds!
--lr_warmup_stepsand--lr_decay_stepscan now be specified as a ratio of the number of training steps, not just the step value. Example:--lr_warmup_steps=0.1or--lr_warmup_steps=10%, etc.When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR #1426 Thanks to sdbds!
Sample image generation during training now works on non-CUDA devices. PR #1433 Thanks to millie-v!
--v_parameterizationis available insdxl_train.py. The results are unpredictable, so use with caution. PR #1505 Thanks to liesened!Fused optimizer is available for SDXL training. PR #1259 Thanks to 2kpr!
--fused_backward_passoption insdxl_train.py. At this time, only AdaFactor is supported. Gradient accumulation is not available.noseems to use less memory thanfp16orbf16.--full_bf16option, you can further reduce the memory usage (but the accuracy will be lower). With the same memory usage as before, you can increase the batch size.Tensor.register_post_accumulate_grad_hook(hook).Optimizer groups feature is added to SDXL training. PR #1319
--fused_optimizer_groups 10insdxl_train.py. Increasing the number of groups reduces memory usage but slows down training. Since the effect is limited to a certain number, it is recommended to specify 4-10.--fused_optimizer_groupscannot be used with--fused_backward_pass. When using AdaFactor, the memory usage is slightly larger than with Fused optimizer. PyTorch 2.1 or later is required.LoRA+ is supported. PR #1233 Thanks to rockerBOO!
loraplus_lr_ratiowith--network_args. Example:--network_args "loraplus_lr_ratio=16"loraplus_unet_lr_ratioandloraplus_lr_ratiocan be specified separately for U-Net and Text Encoder.--network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"or--network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"etc.network_modulenetworks.loraandnetworks.dyloraare available.The feature to use the transparency (alpha channel) of the image as a mask in the loss calculation has been added. PR #1223 Thanks to u-haru!
--alpha_maskoption in the training script or specifyalpha_mask = truein the dataset configuration file.LoRA training in SDXL now supports block-wise learning rates and block-wise dim (rank). PR #1331
Negative learning rates can now be specified during SDXL model training. PR #1277 Thanks to Cauldrath!
=like--learning_rate=-1e-7.Training scripts can now output training settings to wandb or Tensor Board logs. Specify the
--log_configoption. PR #1285 Thanks to ccharest93, plucked, rockerBOO, and VelocityRa!The ControlNet training script
train_controlnet.pyfor SD1.5/2.x was not working, but it has been fixed. PR #1284 Thanks to sdbds!train_network.pyandsdxl_train_network.pynow restore the order/position of data loading from DataSet when resuming training. PR #1353 #1359 Thanks to KohakuBlueleaf!--skip_until_initial_stepoption to skip data loading until the specified step. If not specified, data loading starts from the beginning of the DataSet (same as before).--resumeis specified, the step saved in the state is used.--initial_stepor--initial_epochoption to skip data loading until the specified step or epoch. Use these options in conjunction with--skip_until_initial_step. These options can be used without--resume(use them when resuming training with--network_weights).An option
--disable_mmap_load_safetensorsis added to disable memory mapping when loading the model's .safetensors in SDXL. PR #1266 Thanks to Zovjsra!sdxl_train.py,sdxl_train_network.py,sdxl_train_textual_inversion.py, andsdxl_train_control_net_lllite.py.When there is an error in the cached latents file on disk, the file name is now displayed. PR #1278 Thanks to Cauldrath!
Fixed an error that occurs when specifying
--max_dataloader_n_workersintag_images_by_wd14_tagger.pywhen Onnx is not used. PR #1291 issue #1290 Thanks to frodo821!Fixed a bug that
caption_separatorcannot be specified in the subset in the dataset settings .toml file. #1312 and #1313 Thanks to rockerBOO!Fixed a potential bug in ControlNet-LLLite training. PR #1322 Thanks to aria1th!
Fixed some bugs when using DeepSpeed. Related #1247
Added a prompt option
--ftogen_imgs.pyto specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.