From 28f3e602b679b51cc0b7d2bf13103199964bdadd Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Fri, 17 Mar 2023 17:35:43 +0800 Subject: [PATCH 1/9] Update requirements.txt --- examples/images/diffusion/requirements.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/images/diffusion/requirements.txt b/examples/images/diffusion/requirements.txt index d0af35353b66..59d027fcf60f 100644 --- a/examples/images/diffusion/requirements.txt +++ b/examples/images/diffusion/requirements.txt @@ -1,10 +1,10 @@ albumentations==1.3.0 -opencv-python==4.6.0 +opencv-python==4.6.0.66 pudb==2019.2 prefetch_generator imageio==2.9.0 imageio-ffmpeg==0.4.2 -torchmetrics==0.6 +torchmetrics==0.7 omegaconf==2.1.1 test-tube>=0.7.5 streamlit>=0.73.1 From 8fac53a3f140ae432ed96c90ffc1aa0fe9e44468 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Fri, 17 Mar 2023 17:39:10 +0800 Subject: [PATCH 2/9] Update environment.yaml --- examples/images/diffusion/environment.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/images/diffusion/environment.yaml b/examples/images/diffusion/environment.yaml index 5164be72e556..f4b1bebd7fc8 100644 --- a/examples/images/diffusion/environment.yaml +++ b/examples/images/diffusion/environment.yaml @@ -3,7 +3,7 @@ channels: - pytorch - defaults dependencies: - - python=3.9.12 + - python=3.8.16 - pip=20.3 - cudatoolkit=11.3 - pytorch=1.12.1 From 38ebe705af38b8d6f21c95bab81d099ff93d8748 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Fri, 17 Mar 2023 18:26:36 +0800 Subject: [PATCH 3/9] Update README.md --- examples/images/diffusion/README.md | 41 +++++++++++++++++------------ 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/examples/images/diffusion/README.md b/examples/images/diffusion/README.md index cc57f6d54a8e..34e9dc1a7498 100644 --- a/examples/images/diffusion/README.md +++ b/examples/images/diffusion/README.md @@ -40,15 +40,14 @@ This project is in rapid development. ### Option #1: install from source #### Step 1: Requirements -A suitable [conda](https://conda.io/) environment named `ldm` can be created -and activated with: +To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6/11.8. For your convience, we have set up the rest of packages here. You can create and activate a suitable [conda](https://conda.io/) environment named `ldm` : ``` conda env create -f environment.yaml conda activate ldm ``` -You can also update an existing [latent diffusion](https://github.com/CompVis/latent-diffusion) environment by running +You can also update an existing [latent diffusion](https://github.com/CompVis/latent-diffusion) environment by running: ``` conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch @@ -57,32 +56,38 @@ pip install transformers diffusers invisible-watermark #### Step 2: install lightning -Install Lightning version later than 2022.01.04. We suggest you install lightning from source. +Install Lightning version later than 2022.01.04. We suggest you install lightning from source. Notice that the default download path of pip should be within the conda environment, or you may need to specify using 'which pip' and redirect the path into conda environment. -##### From Source +##### From Source: ``` git clone https://github.com/Lightning-AI/lightning.git pip install -r requirements.txt python setup.py install ``` -##### From pip +##### From pip: ``` pip install pytorch-lightning ``` -#### Step 3:Install [Colossal-AI](https://colossalai.org/download/) From Our Official Website +#### Step 3:Install [Colossal-AI](https://colossalai.org/download/) From Our Official Website: -##### From pip +You can install the latest version (0.2.7) from our official website or from source. Notice that the suitable version for this training is colossalai(0.2.5), which stands for torch(1.12.1). -For example, you can install v0.2.0 from our official website. +##### Download suggested verision for this training: + +``` +pip install colossalai=0.2.5 +``` + +##### Download the latest version from pip for latest torch version: ``` pip install colossalai ``` -##### From source +##### From source: ``` git clone https://github.com/hpcaitech/ColossalAI.git @@ -92,10 +97,12 @@ cd ColossalAI CUDA_EXT=1 pip install . ``` -#### Step 3:Accelerate with flash attention by xformers(Optional) +#### Step 4:Accelerate with flash attention by xformers(Optional) + +Notice that xformers will accelerate the training process in cost of extra disk space. The suitable version of xformers for this training process is 0.12.0. You can download xformers directly via pip. For more release versions, feel free to check its official website: [XFormers](./https://pypi.org/project/xformers/) ``` -pip install xformers +pip install xformers==0.0.12 ``` ### Option #2: Use Docker @@ -174,8 +181,7 @@ you should the change the `data.file_path` in the `config/train_colossalai.yaml` ## Training -We provide the script `train_colossalai.sh` to run the training task with colossalai, -and can also use `train_ddp.sh` to run the training task with ddp to compare. +We provide the script `train_colossalai.sh` to run the training task with colossalai. Meanwhile, we have enlightened other training process such as DDP model in PyTorch. You can also use `train_ddp.sh` to run the training task with ddp to compare the corresponding performance. In `train_colossalai.sh` the main command is: @@ -193,9 +199,10 @@ python main.py --logdir /tmp/ --train --base configs/train_colossalai.yaml --ckp You can change the trainging config in the yaml file -- devices: device number used for training, default 8 -- max_epochs: max training epochs, default 2 -- precision: the precision type used in training, default 16 (fp16), you must use fp16 if you want to apply colossalai +- devices: device number used for training, default = 8 +- max_epochs: max training epochs, default = 2 +- precision: the precision type used in training, default = 16 (fp16), you must use fp16 if you want to apply colossalai +- placement_policy: the training strategy supported by Colossal AI, defult = 'cuda', which refers to loading all the parameters into cuda memory. On the other hand, 'cpu' refers to 'cpu offload' strategy while 'auto' enables 'Gemini', both featured by Colossal AI. - more information about the configuration of ColossalAIStrategy can be found [here](https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html#colossal-ai) From 7b185e5ebcbf84076ea83963f4df41f4c086fb37 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Mon, 20 Mar 2023 11:46:28 +0800 Subject: [PATCH 4/9] Update environment.yaml --- examples/images/diffusion/environment.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/images/diffusion/environment.yaml b/examples/images/diffusion/environment.yaml index ec6bb8a532af..d1ec69c1a585 100644 --- a/examples/images/diffusion/environment.yaml +++ b/examples/images/diffusion/environment.yaml @@ -3,7 +3,7 @@ channels: - pytorch - defaults dependencies: - - python=3.8.16 + - python=3.9.12 - pip=20.3 - cudatoolkit=11.3 - pytorch=1.12.1 From 8db16c7ba3f1cbf5345ea1361ad0aee605c57caa Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Tue, 21 Mar 2023 15:11:17 +0800 Subject: [PATCH 5/9] Update README.md --- examples/images/dreambooth/README.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/examples/images/dreambooth/README.md b/examples/images/dreambooth/README.md index 14ed66c8d45b..f489b2267aed 100644 --- a/examples/images/dreambooth/README.md +++ b/examples/images/dreambooth/README.md @@ -5,12 +5,12 @@ The `train_dreambooth_colossalai.py` script shows how to implement the training By accommodating model data in CPU and GPU and moving the data to the computing device when necessary, [Gemini](https://www.colossalai.org/docs/advanced_tutorials/meet_gemini), the Heterogeneous Memory Manager of [Colossal-AI](https://github.com/hpcaitech/ColossalAI) can breakthrough the GPU memory wall by using GPU and CPU memory (composed of CPU DRAM or nvme SSD memory) together at the same time. Moreover, the model scale can be further improved by combining heterogeneous training with the other parallel approaches, such as data parallel, tensor parallel and pipeline parallel. -## Installing the dependencies +## Installation -Before running the scripts, make sure to install the library's training dependencies: +To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6-11.8. Notice that you may want to make sure the module versions suitable for the whole environment. Before running the scripts, make sure to install the library's training dependencies: ```bash -pip install -r requirements_colossalai.txt +pip install -r requirements.txt ``` ### Install [colossalai](https://github.com/hpcaitech/ColossalAI.git) @@ -37,9 +37,7 @@ The `text` include the tag `Teyvat`, `Name`,`Element`, `Weapon`, `Region`, `Mode ## Training -The arguement `placement` can be `cpu`, `auto`, `cuda`, with `cpu` the GPU RAM required can be minimized to 4GB but will deceleration, with `cuda` you can also reduce GPU memory by half but accelerated training, with `auto` a more balanced solution for speed and memory can be obtained。 - -**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___** +We provide the script `colossalai.sh` to run the training task with colossalai. Meanwhile, we also provided traditional training process of dreambooth, `dreambooth.sh`, for possible comparation. For instance, the script of training process for [stable-diffusion-v1-4] model can be modified into: ```bash export MODEL_NAME="CompVis/stable-diffusion-v1-4" @@ -59,12 +57,17 @@ torchrun --nproc_per_node 2 train_dreambooth_colossalai.py \ --max_train_steps=400 \ --placement="cuda" ``` - +- `Model_NAME` refers to the model you are training. +- `INSTANCE_DIR` refers to personalized path to instance images, you might need to insert information here. +- `OUTPUT_DIR` refers to local path to save the trained model, you might need to find a path with enough space. +- `resolution` refers to the corresponding resolution number of your target model. Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model. +- `placement` refers to the training strategy supported by Colossal AI, defult = 'cuda', which refers to loading all the parameters into cuda memory. On the other hand, 'cpu' refers to 'cpu offload' strategy while 'auto' enables 'Gemini', both featured by Colossal AI. ### Training with prior-preservation loss Prior-preservation is used to avoid overfitting and language-drift. Refer to the paper to learn more about it. For prior-preservation we first generate images using the model with a class prompt and then use those during training along with our data. -According to the paper, it's recommended to generate `num_epochs * num_samples` images for prior-preservation. 200-300 works well for most cases. The `num_class_images` flag sets the number of images to generate with the class prompt. You can place existing images in `class_data_dir`, and the training script will generate any additional images so that `num_class_images` are present in `class_data_dir` during training time. + +According to the paper, it's recommended to generate `num_epochs * num_samples` images for prior-preservation. 200-300 works well for most cases. The `num_class_images` flag sets the number of images to generate with the class prompt. You can place existing images in `class_data_dir`, and the training script will generate any additional images so that `num_class_images` are present in `class_data_dir` during training time. The general script can be then modified as the following. ```bash export MODEL_NAME="CompVis/stable-diffusion-v1-4" @@ -91,7 +94,7 @@ torchrun --nproc_per_node 2 train_dreambooth_colossalai.py \ ## Inference -Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `identifier`(e.g. sks in above example) in your prompt. +Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `identifier`(e.g. `--instance_prompt="a photo of sks dog" ` in the above example) in your prompt. ```python from diffusers import StableDiffusionPipeline From 86bc6fa242a8f893c4ad639c45a708ae1b1597c8 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Tue, 21 Mar 2023 15:11:53 +0800 Subject: [PATCH 6/9] Update README.md --- examples/images/dreambooth/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/images/dreambooth/README.md b/examples/images/dreambooth/README.md index f489b2267aed..b067a437c764 100644 --- a/examples/images/dreambooth/README.md +++ b/examples/images/dreambooth/README.md @@ -57,7 +57,7 @@ torchrun --nproc_per_node 2 train_dreambooth_colossalai.py \ --max_train_steps=400 \ --placement="cuda" ``` -- `Model_NAME` refers to the model you are training. +- `MODEL_NAME` refers to the model you are training. - `INSTANCE_DIR` refers to personalized path to instance images, you might need to insert information here. - `OUTPUT_DIR` refers to local path to save the trained model, you might need to find a path with enough space. - `resolution` refers to the corresponding resolution number of your target model. Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model. From 19f6d61f24e1e2a607bac8c5fd681b3e9bac1524 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Tue, 21 Mar 2023 15:13:00 +0800 Subject: [PATCH 7/9] Delete requirements_colossalai.txt --- examples/images/dreambooth/requirements_colossalai.txt | 8 -------- 1 file changed, 8 deletions(-) delete mode 100644 examples/images/dreambooth/requirements_colossalai.txt diff --git a/examples/images/dreambooth/requirements_colossalai.txt b/examples/images/dreambooth/requirements_colossalai.txt deleted file mode 100644 index c4a0e91703bb..000000000000 --- a/examples/images/dreambooth/requirements_colossalai.txt +++ /dev/null @@ -1,8 +0,0 @@ -diffusers -torch -torchvision -ftfy -tensorboard -modelcards -transformers -colossalai==0.2.0+torch1.12cu11.3 -f https://release.colossalai.org From 8fc369120a6919e479b5b48a2385a35e7b1b9afc Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Tue, 21 Mar 2023 15:13:33 +0800 Subject: [PATCH 8/9] Update requirements.txt --- examples/images/dreambooth/requirements.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/examples/images/dreambooth/requirements.txt b/examples/images/dreambooth/requirements.txt index 6c4f40fb5dd0..1ec828c630ef 100644 --- a/examples/images/dreambooth/requirements.txt +++ b/examples/images/dreambooth/requirements.txt @@ -5,4 +5,3 @@ transformers>=4.21.0 ftfy tensorboard modelcards -colossalai From 0a6d6f7cb9cca94879d4203611672c648a17ede1 Mon Sep 17 00:00:00 2001 From: NatalieC323 <127177614+NatalieC323@users.noreply.github.com> Date: Tue, 21 Mar 2023 15:14:21 +0800 Subject: [PATCH 9/9] Update README.md --- examples/images/diffusion/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/images/diffusion/README.md b/examples/images/diffusion/README.md index 22970ced064e..99cbd39ef849 100644 --- a/examples/images/diffusion/README.md +++ b/examples/images/diffusion/README.md @@ -78,7 +78,7 @@ You can install the latest version (0.2.7) from our official website or from sou ##### Download suggested verision for this training ``` -pip install colossalai=0.2.5 +pip install colossalai==0.2.5 ``` ##### Download the latest version from pip for latest torch version