Skip to content

[chat]: bugs of Coati's train_prompts.py #4023

@cwher

Description

@cwher

🐛 Describe the bug

Description

Some combinations of arguments lead to errors of train_prompts.py.

Details

  • Error of train_prompts.py

    These errors can be reproduced by modify test_ci.sh in ColossalAI/applications/Chat/examples.

    The combinations are,

  • Error of modified train_prompts.py

    These errors can be reproduced through the following script.

    import argparse
    
    from coati.models.bloom import BLOOMActor
    from coati.models.gpt import GPTActor
    from coati.models.llama import LlamaActor
    from coati.models.opt import OPTActor
    from coati.models.roberta import RoBERTaActor
    from coati.trainer.strategies import ColossalAIStrategy
    
    from colossalai.nn.optimizer import HybridAdam
    
    
    def main(args):
        initializer_dict = {
            'gpt': lambda: GPTActor(),
            'bloom': lambda: BLOOMActor(),
            'opt': lambda: OPTActor(),
            'llama': lambda: LlamaActor(),
            'roberta': lambda: RoBERTaActor(),
        }
        initializer = initializer_dict[args.model]
        strategy = ColossalAIStrategy(stage=3, placement_policy='cuda', initial_scale=2**5)
    
        with strategy.model_init_context():
            # configure model
            actor = initializer()
    
        # configure optimizer
        actor_optim = HybridAdam(actor.parameters(), lr=1e-7)
    
        (actor, actor_optim) = strategy.prepare((actor, actor_optim))
    
        try:
            # FIXME: this causes the error
            actor.to("cpu")
            print(f"[SUCCESS]: {strategy.unwrap_model(actor).__class__.__name__}")
        except RuntimeError as e:
            print(f"[ERROR]: {strategy.unwrap_model(actor).__class__.__name__}")
            # raise e
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--model', type=str, default='gpt',
                            choices=['gpt', 'bloom', 'opt', 'llama', 'roberta'])
        args = parser.parse_args()
        main(args)
      set -xe
    
      set_n_least_used_CUDA_VISIBLE_DEVICES() {
          local n=${1:-"9999"}
          echo "GPU Memory Usage:"
          local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
              tail -n +2 |
              nl -v 0 |
              tee /dev/tty |
              sort -g -k 2 |
              awk '{print $1}' |
              head -n $n)
          export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
          echo "Now CUDA_VISIBLE_DEVICES is set to:"
          echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
      }
    
      set_n_least_used_CUDA_VISIBLE_DEVICES 4
    
      export CUDA_LAUNCH_BLOCKING=1
      for model in 'gpt' 'bloom' 'opt' 'llama' 'roberta'; do
          torchrun --standalone --nproc_per_node=4 reproduce_error.py --model $model
      done

    The combinations are,

    • gpt2-colossalai_gemini opt-colossalai_gemini llama-colossalai_gemini roberta-colossalai_gemini

      RuntimeError: CUDA error: invalid argument

Environment

  • PyTorch: 1.13.1

  • Colossal-AI: commit b3ab7fbabf

  • Transformers: commit 61f79b2986

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions