🐛 Describe the bug
/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# vim /aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py
(/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# export NCCL_IB_DISABLE=1; export NCCL_SOCKET_IFNAME=eth0;NCCL_DEBUG=INFO;TORCH_CPP_LOG_LEVEL=DEBUG; export NCCL_DEBUG_SUBSYS=ALL;export TORCH_DISTRIBUTED_DEBUG=INFO; colossalai run --nproc_per_node 2 --hostfile hostfile pretrain.py --d "/aml/data/boyang/RedPajama-Data-1T-Sample" -p "hybrid_parallel" -c "7b" -b 2 -l 2048
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A100, is localhost: False, exception: No authentication methods available
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
Epoch 0: 0%| | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in
main()
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
outputs = booster.execute_pipeline(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
outputs = self.schedule.forward_backward_step(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
output_obj = model_forward(model, micro_batch, input_obj)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
return model(**data, **internal_inputs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
return super().forward(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
return self.module(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
outputs = LlamaPipelineForwards.llama_model_forward(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
attention_mask = self._prepare_decoder_attention_mask(
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
Epoch 0: 0%| | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in
main()
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
outputs = booster.execute_pipeline(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
outputs = self.schedule.forward_backward_step(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
output_obj = model_forward(model, micro_batch, input_obj)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
return model(**data, **internal_inputs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
return super().forward(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
return self.module(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
outputs = LlamaPipelineForwards.llama_model_forward(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
attention_mask = self._prepare_decoder_attention_mask(
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4959) of binary: /aml2/colo/bin/python
Traceback (most recent call last):
File "/aml2/colo/bin/torchrun", line 8, in
sys.exit(main())
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
pretrain.py FAILED
Failures:
[1]:
time : 2023-11-12_05:22:03
host : A200
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 4960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-11-12_05:22:03
host : A200
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 4959)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A200, is localhost: True, exception: Encountered a bad command exit code!
Environment
Ubuntu 18.04
2* 2A10080G

torch 1.13.1
cuda 11.7
colossal ai the latest
BTW: TP and gemini is good, but PP not working as design
🐛 Describe the bug
/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# vim /aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py
(/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# export NCCL_IB_DISABLE=1; export NCCL_SOCKET_IFNAME=eth0;NCCL_DEBUG=INFO;TORCH_CPP_LOG_LEVEL=DEBUG; export NCCL_DEBUG_SUBSYS=ALL;export TORCH_DISTRIBUTED_DEBUG=INFO; colossalai run --nproc_per_node 2 --hostfile hostfile pretrain.py --d "/aml/data/boyang/RedPajama-Data-1T-Sample" -p "hybrid_parallel" -c "7b" -b 2 -l 2048
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A100, is localhost: False, exception: No authentication methods available
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
Epoch 0: 0%| | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in
main()
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
outputs = booster.execute_pipeline(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
outputs = self.schedule.forward_backward_step(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
output_obj = model_forward(model, micro_batch, input_obj)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
return model(**data, **internal_inputs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
return super().forward(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
return self.module(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
outputs = LlamaPipelineForwards.llama_model_forward(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
attention_mask = self._prepare_decoder_attention_mask(
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
Epoch 0: 0%| | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in
main()
File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
outputs = booster.execute_pipeline(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
outputs = self.schedule.forward_backward_step(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
output_obj = model_forward(model, micro_batch, input_obj)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
return model(**data, **internal_inputs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
return super().forward(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
return self.module(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
outputs = LlamaPipelineForwards.llama_model_forward(
File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
attention_mask = self._prepare_decoder_attention_mask(
File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4959) of binary: /aml2/colo/bin/python
Traceback (most recent call last):
File "/aml2/colo/bin/torchrun", line 8, in
sys.exit(main())
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
pretrain.py FAILED
Failures:
[1]:
time : 2023-11-12_05:22:03
host : A200
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 4960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-11-12_05:22:03
host : A200
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 4959)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A200, is localhost: True, exception: Encountered a bad command exit code!
Environment
Ubuntu 18.04
2* 2A10080G
torch 1.13.1
cuda 11.7
colossal ai the latest
BTW: TP and gemini is good, but PP not working as design