[BUG]: AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask' when using PP

### 🐛 Describe the bug

/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# vim /aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py 
(/aml2/colo) root@A200:/aml2/ColossalAI/examples/language/llama2# export NCCL_IB_DISABLE=1; export NCCL_SOCKET_IFNAME=eth0;NCCL_DEBUG=INFO;TORCH_CPP_LOG_LEVEL=DEBUG; export NCCL_DEBUG_SUBSYS=ALL;export TORCH_DISTRIBUTED_DEBUG=INFO;  colossalai run --nproc_per_node 2 --hostfile hostfile pretrain.py --d "/aml/data/boyang/RedPajama-Data-1T-Sample" -p "hybrid_parallel" -c "7b" -b 2  -l 2048
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A100, is localhost: False, exception: No authentication methods available
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/aml2/colo/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
Epoch 0:   0%|          | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in <module>
    main()
  File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
    outputs = booster.execute_pipeline(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
    return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
    outputs = self.schedule.forward_backward_step(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
    output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
    output_obj = model_forward(model, micro_batch, input_obj)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
    return model(**data, **internal_inputs)
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
    return super().forward(*args, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
    return self.module(*args, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
    outputs = LlamaPipelineForwards.llama_model_forward(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
    attention_mask = self._prepare_decoder_attention_mask(
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
Epoch 0:   0%|          | 0/116314 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 331, in <module>
    main()
  File "/aml2/ColossalAI/examples/language/llama2/pretrain.py", line 289, in main
    outputs = booster.execute_pipeline(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/booster.py", line 205, in execute_pipeline
    return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 832, in execute_pipeline
    outputs = self.schedule.forward_backward_step(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 288, in forward_backward_step
    output_obj = self.forward_step(model, input_obj, criterion, accum_loss, outputs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 164, in forward_step
    output_obj = model_forward(model, micro_batch, input_obj)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py", line 120, in model_forward
    return model(**data, **internal_inputs)
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 118, in forward
    return super().forward(*args, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/interface/model.py", line 25, in forward
    return self.module(*args, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 233, in llama_for_causal_lm_forward
    outputs = LlamaPipelineForwards.llama_model_forward(
  File "/aml2/colo/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 100, in llama_model_forward
    attention_mask = self._prepare_decoder_attention_mask(
  File "/aml2/colo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4959) of binary: /aml2/colo/bin/python
Traceback (most recent call last):
  File "/aml2/colo/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/aml2/colo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-12_05:22:03
  host      : A200
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 4960)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-12_05:22:03
  host      : A200
  rank      : 2 (local_rank: 0)
  exitcode  : 1 (pid: 4959)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=A100 --master_port=29500 pretrain.py --d /aml/data/boyang/RedPajama-Data-1T-Sample -p hybrid_parallel -c 7b -b 2 -l 2048 on A200, is localhost: True, exception: Encountered a bad command exit code!


### Environment

Ubuntu 18.04

2* 2*A100*80G

![image](https://github.com/hpcaitech/ColossalAI/assets/15274284/118fb49d-18c5-4275-9ce2-4e2bf1f84124)

torch 1.13.1

cuda 11.7

colossal ai the latest


BTW: TP and gemini is good, but PP not working as design



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask' when using PP #5041

🐛 Describe the bug

pretrain.py FAILED

Failures:
[1]:
time : 2023-11-12_05:22:03
host : A200
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 4960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-11-12_05:22:03
host : A200
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 4959)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: AttributeError: 'LlamaModel' object has no attribute '_prepare_decoder_attention_mask' when using PP #5041

Description

🐛 Describe the bug

pretrain.py FAILED

Failures: [1]: time : 2023-11-12_05:22:03 host : A200 rank : 3 (local_rank: 1) exitcode : 1 (pid: 4960) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-11-12_05:22:03 host : A200 rank : 2 (local_rank: 0) exitcode : 1 (pid: 4959) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
[1]:
time : 2023-11-12_05:22:03
host : A200
rank : 3 (local_rank: 1)
exitcode : 1 (pid: 4960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-11-12_05:22:03
host : A200
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 4959)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html