🐛 Describe the bug
Description
Some combinations of arguments lead to errors of train_prompts.py.
Details
-
Error of train_prompts.py
These errors can be reproduced by modify test_ci.sh in ColossalAI/applications/Chat/examples.
The combinations are,
-
Error of modified train_prompts.py
These errors can be reproduced through the following script.
import argparse
from coati.models.bloom import BLOOMActor
from coati.models.gpt import GPTActor
from coati.models.llama import LlamaActor
from coati.models.opt import OPTActor
from coati.models.roberta import RoBERTaActor
from coati.trainer.strategies import ColossalAIStrategy
from colossalai.nn.optimizer import HybridAdam
def main(args):
initializer_dict = {
'gpt': lambda: GPTActor(),
'bloom': lambda: BLOOMActor(),
'opt': lambda: OPTActor(),
'llama': lambda: LlamaActor(),
'roberta': lambda: RoBERTaActor(),
}
initializer = initializer_dict[args.model]
strategy = ColossalAIStrategy(stage=3, placement_policy='cuda', initial_scale=2**5)
with strategy.model_init_context():
# configure model
actor = initializer()
# configure optimizer
actor_optim = HybridAdam(actor.parameters(), lr=1e-7)
(actor, actor_optim) = strategy.prepare((actor, actor_optim))
try:
# FIXME: this causes the error
actor.to("cpu")
print(f"[SUCCESS]: {strategy.unwrap_model(actor).__class__.__name__}")
except RuntimeError as e:
print(f"[ERROR]: {strategy.unwrap_model(actor).__class__.__name__}")
# raise e
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default='gpt',
choices=['gpt', 'bloom', 'opt', 'llama', 'roberta'])
args = parser.parse_args()
main(args)
set -xe
set_n_least_used_CUDA_VISIBLE_DEVICES() {
local n=${1:-"9999"}
echo "GPU Memory Usage:"
local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
tail -n +2 |
nl -v 0 |
tee /dev/tty |
sort -g -k 2 |
awk '{print $1}' |
head -n $n)
export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
echo "Now CUDA_VISIBLE_DEVICES is set to:"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
}
set_n_least_used_CUDA_VISIBLE_DEVICES 4
export CUDA_LAUNCH_BLOCKING=1
for model in 'gpt' 'bloom' 'opt' 'llama' 'roberta'; do
torchrun --standalone --nproc_per_node=4 reproduce_error.py --model $model
done
The combinations are,
Environment
🐛 Describe the bug
Description
Some combinations of arguments lead to errors of
train_prompts.py.Details
Error of
train_prompts.pyThese errors can be reproduced by modify
test_ci.shinColossalAI/applications/Chat/examples.The combinations are,
gpt2-ddpEarlier reported by [BUG]: bug in training rm with ddp strategy with single machine multi-GPUs! #3421.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.
llama-naivellama-ddpllama-colossalai_geminillama-colossalai_zero2Repository Not Found for url: https://huggingface.co/{...}/resolve/main/tokenizer.model.
roberta-naiveroberta-ddproberta-colossalai_geminiroberta-colossalai_zero2CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
cublasCreate(handle)Error of modified
train_prompts.pyThese errors can be reproduced through the following script.
The combinations are,
gpt2-colossalai_geminiopt-colossalai_geminillama-colossalai_geminiroberta-colossalai_geminiRuntimeError: CUDA error: invalid argument
Environment
PyTorch: 1.13.1Colossal-AI: commitb3ab7fbabfTransformers: commit61f79b2986