🐛 Describe the bug
I am following this blog https://medium.com/pytorch/colossalchat-an-open-source-solution-for-cloning-chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b to train a 6.7B parameter model. The blog outlines training scheme for a 7B model. So I expect it to be more or less same.
When I use colossalai_zero2_cpu, the execution stops with the following traceback
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 644 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 645 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 646 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 647 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 649 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 650 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 651 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 4 (pid: 648) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
train_prompts.py FAILED
----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-08_01:29:36
host : dgx-server01
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 648)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 648
====================================================
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /rlhf/applications/Chat && export NV_LIBCUBLAS_VERSION="11.5.1.109-1" NVIDIA_VISIBLE_DEVICES="all" NV_NVML_DEV_VERSION="11.3.58-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.9.9-1+cuda11.3" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.9.9-1" HOSTNAME="dgx-server01" NVIDIA_REQUIRE_CUDA="cuda>=11.3 brand=tesla,driver>=418,driver<419 driver>=450" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-3=11.5.1.109-1" NV_NVTX_VERSION="11.3.109-1" NV_CUDA_CUDART_DEV_VERSION="11.3.109-1" NV_LIBCUSPARSE_VERSION="11.6.0.109-1" NV_LIBNPP_VERSION="11.3.3.95-1" NCCL_VERSION="2.9.9-1" PWD="/rlhf/applications/Chat" NV_CUDNN_PACKAGE="libcudnn8=8.2.0.53-1+cuda11.3" NVIDIA_DRIVER_CAPABILITIES="compute,utility" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-11-3=11.3.111-1" NV_LIBNPP_PACKAGE="libnpp-11-3=11.3.3.95-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" NV_LIBCUBLAS_DEV_VERSION="11.5.1.109-1" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-3" NV_CUDA_CUDART_VERSION="11.3.109-1" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CUDA_VERSION="11.3.1" NV_LIBCUBLAS_PACKAGE="libcublas-11-3=11.5.1.109-1" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-3=11.3.3.95-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-3" NV_LIBNPP_DEV_VERSION="11.3.3.95-1" LESSCLOSE="/usr/bin/lesspipe %s %s" TERM="xterm" NV_LIBCUSPARSE_DEV_VERSION="11.6.0.109-1" LESSOPEN="| /usr/bin/lesspipe %s" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.2.0.53" SHLVL="1" NV_CUDA_LIB_VERSION="11.3.1-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.2.0.53-1+cuda11.3" NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-3" NV_LIBNCCL_PACKAGE="libnccl2=2.9.9-1+cuda11.3" LD_LIBRARY_PATH="/root/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64" NV_NVPROF_VERSION="11.3.111-1" CUDA_HOME="/usr/local/cuda" PATH="/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.9.9-1" OLDPWD="/rlhf" _="/opt/conda/bin/colossalai" LC_CTYPE="C.UTF-8" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
Believing it to be the case of insufficient GPU memory, I thought perhaps the model needs to be sharded.
With colossalai_gemini startegy, the job crashed with the same errors as before.
Next I though of trying a mix and match of last two strategies
elif args.strategy == 'colossalai_zero1_cpu':
strategy = ColossalAIStrategy(stage=3, placement_policy='cpu', shard_init=True)
For the sake of testing, I reduced the size of reward model to 2.7B while keeping the actor model size to 6.7B. It led to following errors/warnings
/rlhf/applications/Chat/coati/trainer/strategies/colossalai.py:91: UserWarning: Shard init is not supported model.from_pretrained() yet. Please load weights after strategy.prepare()
size mismatch *** *****: copying a param with shape torch.Size([2560]) from checkpoint, the shape in current model is torch.Size([320])
Following the advice here, I implemented the following
class GPTActor(Actor):
"""
GPT Actor model.
Args:
pretrained (str): Pretrained model name or path.
config (GPT2Config): Model config.
checkpoint (bool): Enable gradient checkpointing.
lora_rank (int): Rank of the LoRa layer.
lora_train_bias (str): Bias training strategy for the LoRa layer.
"""
def __init__(self,
pretrained: Optional[str] = None,
config: Optional[GPT2Config] = None,
checkpoint: bool = False,
lora_rank: int = 0,
lora_train_bias: str = 'none',
state_dict:dict = None) -> None:
if pretrained is not None:
if state_dict is not None:
model = AutoModelForCausalLM.from_config(AutoConfig.from_pretrained(pretrained))
for n, p in model.named_parameters():
x = state_dict[n]
x = x.chunk(torch.cuda.device_count(), dim=-1)
x = x[dist.get_rank()]
p.data.copy_(x)
else:
model = AutoModelForCausalLM.from_pretrained(pretrained)
Something to notice, even with shard_init = True
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 8,
pipeline parallel size: 1, tensor parallel size: 1
I would have expected the pipeline parallel size or tensor parallel size to be different from 1.
After implemented the above hacky sharding solution, the models were initialized alright but the job crashed with the following traceback
Traceback (most recent call last):
File "/rlhf/applications/Chat/train_prompts.py", line 274, in <module>
Traceback (most recent call last):
File "/rlhf/applications/Chat/train_prompts.py", line 274, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "/rlhf/applications/Chat/train_prompts.py", line 274, in <module>
File "/rlhf/applications/Chat/train_prompts.py", line 274, in <module>
main(args)
File "/rlhf/applications/Chat/train_prompts.py", line 201, in main
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))
File "/rlhf/applications/Chat/coati/trainer/strategies/base.py", line 84, in prepare
optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model))
File "/rlhf/applications/Chat/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer
main(args)
File "/rlhf/applications/Chat/train_prompts.py", line 201, in main
main(args)
File "/rlhf/applications/Chat/train_prompts.py", line 201, in main
main(args)
File "/rlhf/applications/Chat/train_prompts.py", line 201, in main
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim)) File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/wrapper.py", line 88, in zero_optim_wrapper
File "/rlhf/applications/Chat/coati/trainer/strategies/base.py", line 84, in prepare
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model))
File "/rlhf/applications/Chat/coati/trainer/strategies/base.py", line 84, in prepare
File "/rlhf/applications/Chat/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))
assert hasattr(model, "_colo_zero_stage"), "You should use `zero_ddp_wrapper` first"
File "/rlhf/applications/Chat/coati/trainer/strategies/base.py", line 84, in prepare
AssertionError: You should use `zero_ddp_wrapper` first
optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model))
File "/rlhf/applications/Chat/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer
optimizer = self.setup_optimizer(optimizer, self._unwrap_model(model))
File "/rlhf/applications/Chat/coati/trainer/strategies/colossalai.py", line 147, in setup_optimizer
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)
File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/wrapper.py", line 88, in zero_optim_wrapper
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)assert hasattr(model, "_colo_zero_stage"), "You should use `zero_ddp_wrapper` first"
File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/wrapper.py", line 88, in zero_optim_wrapper
File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/wrapper.py", line 88, in zero_optim_wrapper
AssertionError: You should use `zero_ddp_wrapper` first
assert hasattr(model, "_colo_zero_stage"), "You should use `zero_ddp_wrapper` first"assert hasattr(model, "_colo_zero_stage"), "You should use `zero_ddp_wrapper` first"
AssertionErrorAssertionError: You should use `zero_ddp_wrapper` first:
You should use `zero_ddp_wrapper` first
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3610) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_prompts.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3611)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3613)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3614)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3615)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3616)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3617)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-08_02:41:09
host : dgx-server01
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3610)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /rlhf/applications/Chat && export NV_LIBCUBLAS_VERSION="11.5.1.109-1" NVIDIA_VISIBLE_DEVICES="all" NV_NVML_DEV_VERSION="11.3.58-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.9.9-1+cuda11.3" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.9.9-1" HOSTNAME="dgx-server01" NVIDIA_REQUIRE_CUDA="cuda>=11.3 brand=tesla,driver>=418,driver<419 driver>=450" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-3=11.5.1.109-1" NV_NVTX_VERSION="11.3.109-1" NV_CUDA_CUDART_DEV_VERSION="11.3.109-1" NV_LIBCUSPARSE_VERSION="11.6.0.109-1" NV_LIBNPP_VERSION="11.3.3.95-1" NCCL_VERSION="2.9.9-1" PWD="/rlhf/applications/Chat" NV_CUDNN_PACKAGE="libcudnn8=8.2.0.53-1+cuda11.3" NVIDIA_DRIVER_CAPABILITIES="compute,utility" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-11-3=11.3.111-1" NV_LIBNPP_PACKAGE="libnpp-11-3=11.3.3.95-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" NV_LIBCUBLAS_DEV_VERSION="11.5.1.109-1" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-3" NV_CUDA_CUDART_VERSION="11.3.109-1" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CUDA_VERSION="11.3.1" NV_LIBCUBLAS_PACKAGE="libcublas-11-3=11.5.1.109-1" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-3=11.3.3.95-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-3" NV_LIBNPP_DEV_VERSION="11.3.3.95-1" LESSCLOSE="/usr/bin/lesspipe %s %s" TERM="xterm" NV_LIBCUSPARSE_DEV_VERSION="11.6.0.109-1" LESSOPEN="| /usr/bin/lesspipe %s" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.2.0.53" SHLVL="1" NV_CUDA_LIB_VERSION="11.3.1-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.2.0.53-1+cuda11.3" NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-3" NV_LIBNCCL_PACKAGE="libnccl2=2.9.9-1+cuda11.3" LD_LIBRARY_PATH="/root/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64" NV_NVPROF_VERSION="11.3.111-1" CUDA_HOME="/usr/local/cuda" PATH="/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.9.9-1" OLDPWD="/rlhf" _="/opt/conda/bin/colossalai" LC_CTYPE="C.UTF-8" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
when both the models were 6.7B parameters.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6012 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6014 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6015 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6016 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6017 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6018 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 6019) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_prompts.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-08_03:03:53
host : dgx-server01
rank : 7 (local_rank: 7)
exitcode : -9 (pid: 6019)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 6019
=====================================================
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Command: 'cd /rlhf/applications/Chat && export NV_LIBCUBLAS_VERSION="11.5.1.109-1" NVIDIA_VISIBLE_DEVICES="all" NV_NVML_DEV_VERSION="11.3.58-1" NV_CUDNN_PACKAGE_NAME="libcudnn8" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.9.9-1+cuda11.3" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.9.9-1" HOSTNAME="dgx-server01" NVIDIA_REQUIRE_CUDA="cuda>=11.3 brand=tesla,driver>=418,driver<419 driver>=450" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-3=11.5.1.109-1" NV_NVTX_VERSION="11.3.109-1" NV_CUDA_CUDART_DEV_VERSION="11.3.109-1" NV_LIBCUSPARSE_VERSION="11.6.0.109-1" NV_LIBNPP_VERSION="11.3.3.95-1" NCCL_VERSION="2.9.9-1" PWD="/rlhf/applications/Chat" NV_CUDNN_PACKAGE="libcudnn8=8.2.0.53-1+cuda11.3" NVIDIA_DRIVER_CAPABILITIES="compute,utility" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-11-3=11.3.111-1" NV_LIBNPP_PACKAGE="libnpp-11-3=11.3.3.95-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" NV_LIBCUBLAS_DEV_VERSION="11.5.1.109-1" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-3" NV_CUDA_CUDART_VERSION="11.3.109-1" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:" CUDA_VERSION="11.3.1" NV_LIBCUBLAS_PACKAGE="libcublas-11-3=11.5.1.109-1" NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-3=11.3.3.95-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-3" NV_LIBNPP_DEV_VERSION="11.3.3.95-1" LESSCLOSE="/usr/bin/lesspipe %s %s" TERM="xterm" NV_LIBCUSPARSE_DEV_VERSION="11.6.0.109-1" LESSOPEN="| /usr/bin/lesspipe %s" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="8.2.0.53" SHLVL="1" NV_CUDA_LIB_VERSION="11.3.1-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.2.0.53-1+cuda11.3" NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-3" NV_LIBNCCL_PACKAGE="libnccl2=2.9.9-1+cuda11.3" LD_LIBRARY_PATH="/root/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64" NV_NVPROF_VERSION="11.3.111-1" CUDA_HOME="/usr/local/cuda" PATH="/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.9.9-1" OLDPWD="/rlhf" _="/opt/conda/bin/colossalai" LC_CTYPE="C.UTF-8" && torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_prompts.py'
Exit code: 1
Stdout: already printed
Stderr: already printed
====== Training on All Nodes =====
127.0.0.1: failure
====== Stopping All Nodes =====
127.0.0.1: finish
All help will be appreciated.
Environment
8 Tesla V100 32GB RAM
shm: 512
CUDA Version 11.6
🐛 Describe the bug
I am following this blog https://medium.com/pytorch/colossalchat-an-open-source-solution-for-cloning-chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b to train a 6.7B parameter model. The blog outlines training scheme for a 7B model. So I expect it to be more or less same.
When I use
colossalai_zero2_cpu, the execution stops with the following tracebackBelieving it to be the case of insufficient GPU memory, I thought perhaps the model needs to be sharded.
With
colossalai_geministartegy, the job crashed with the same errors as before.Next I though of trying a mix and match of last two strategies
For the sake of testing, I reduced the size of reward model to 2.7B while keeping the actor model size to 6.7B. It led to following errors/warnings
Following the advice here, I implemented the following
Something to notice, even with
shard_init = TrueI would have expected the pipeline parallel size or tensor parallel size to be different from 1.
After implemented the above hacky sharding solution, the models were initialized alright but the job crashed with the following traceback
when both the models were 6.7B parameters.
All help will be appreciated.
Environment
8 Tesla V100 32GB RAM
shm: 512
CUDA Version 11.6