[chat]: bugs of Coati's train_prompts.py

### 🐛 Describe the bug

## Description

Some combinations of arguments lead to errors of `train_prompts.py`.

## Details

- Error of `train_prompts.py`

  These errors can be reproduced by modify `test_ci.sh` in `ColossalAI/applications/Chat/examples`.

  The combinations are,

  - [ ] `gpt2-ddp`

    Earlier reported by #3421.

    <img width="952" alt="image" src="https://github.com/hpcaitech/ColossalAI/assets/31888981/440f0d7f-02e3-41b2-ab3e-7beb12b8edf6">

    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.

  - [x] `llama-naive` `llama-ddp` `llama-colossalai_gemini` `llama-colossalai_zero2`

    ```python
    # FIXME: this causes the error
    tokenizer = LlamaTokenizer.from_pretrained(args.pretrain)
    ```

    Repository Not Found for url: https://huggingface.co/{...}/resolve/main/tokenizer.model.

  - [x] `roberta-naive` `roberta-ddp` `roberta-colossalai_gemini` `roberta-colossalai_zero2`

    <img width="945" alt="image" src="https://github.com/hpcaitech/ColossalAI/assets/31888981/da710cc5-7e50-477e-ae68-7eb5ca84f819">

    CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

- Error of modified `train_prompts.py`

  These errors can be reproduced through the following script.

  ```python
  import argparse

  from coati.models.bloom import BLOOMActor
  from coati.models.gpt import GPTActor
  from coati.models.llama import LlamaActor
  from coati.models.opt import OPTActor
  from coati.models.roberta import RoBERTaActor
  from coati.trainer.strategies import ColossalAIStrategy

  from colossalai.nn.optimizer import HybridAdam


  def main(args):
      initializer_dict = {
          'gpt': lambda: GPTActor(),
          'bloom': lambda: BLOOMActor(),
          'opt': lambda: OPTActor(),
          'llama': lambda: LlamaActor(),
          'roberta': lambda: RoBERTaActor(),
      }
      initializer = initializer_dict[args.model]
      strategy = ColossalAIStrategy(stage=3, placement_policy='cuda', initial_scale=2**5)

      with strategy.model_init_context():
          # configure model
          actor = initializer()

      # configure optimizer
      actor_optim = HybridAdam(actor.parameters(), lr=1e-7)

      (actor, actor_optim) = strategy.prepare((actor, actor_optim))

      try:
          # FIXME: this causes the error
          actor.to("cpu")
          print(f"[SUCCESS]: {strategy.unwrap_model(actor).__class__.__name__}")
      except RuntimeError as e:
          print(f"[ERROR]: {strategy.unwrap_model(actor).__class__.__name__}")
          # raise e

  if __name__ == '__main__':
      parser = argparse.ArgumentParser()
      parser.add_argument('--model', type=str, default='gpt',
                          choices=['gpt', 'bloom', 'opt', 'llama', 'roberta'])
      args = parser.parse_args()
      main(args)

  ```

  ```bash
    set -xe

    set_n_least_used_CUDA_VISIBLE_DEVICES() {
        local n=${1:-"9999"}
        echo "GPU Memory Usage:"
        local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
            tail -n +2 |
            nl -v 0 |
            tee /dev/tty |
            sort -g -k 2 |
            awk '{print $1}' |
            head -n $n)
        export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
        echo "Now CUDA_VISIBLE_DEVICES is set to:"
        echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
    }

    set_n_least_used_CUDA_VISIBLE_DEVICES 4

    export CUDA_LAUNCH_BLOCKING=1
    for model in 'gpt' 'bloom' 'opt' 'llama' 'roberta'; do
        torchrun --standalone --nproc_per_node=4 reproduce_error.py --model $model
    done
  ```

  The combinations are,

  - [x] `gpt2-colossalai_gemini` `opt-colossalai_gemini` `llama-colossalai_gemini` `roberta-colossalai_gemini`

    RuntimeError: CUDA error: invalid argument


### Environment

- `PyTorch`: 1.13.1

- `Colossal-AI`: commit `b3ab7fbabf`

- `Transformers`: commit `61f79b2986`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chat]: bugs of Coati's train_prompts.py #4023

🐛 Describe the bug

Description

Details

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[chat]: bugs of Coati's train_prompts.py #4023

Description

🐛 Describe the bug

Description

Details

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions