Skip to content

Pre-Training Neva under pipeline parallel set to 2. #12205

@takuya576

Description

@takuya576

Describe the bug

I can't run the Neva pre-training with "pipeline parallel == 2"(Tensor parallel works well).

I got the following error, when I ran this bash script.

torchrun --nproc_per_node=2 --rdzv-endpoint=localhost:29500 $HOME/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
 ++cluster_type=BCP \
 trainer.precision=bf16 \
 trainer.num_nodes=1 \
 trainer.devices=[0,1] \
 trainer.val_check_interval=50 \
 trainer.limit_val_batches=5 \
 trainer.log_every_n_steps=1 \
 trainer.max_steps=50 \
 model.megatron_amp_O2=True \
 model.micro_batch_size=1 \
 model.global_batch_size=4 \
 model.tensor_model_parallel_size=1 \
 model.pipeline_model_parallel_size=2 \
 model.mcore_gpt=True \
 model.transformer_engine=True \
 model.data.data_path=$DATA_DIR/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json \
 model.data.image_folder=$DATA_DIR/datasets/LLaVA-Pretrain-LCS-558K/images \
 model.tokenizer.library=sentencepiece \
 model.tokenizer.model=$DATA_DIR/tokenizers/tokenizer_neva.model \
 model.encoder_seq_length=4096 \
 model.num_layers=32 \
 model.hidden_size=4096 \
 model.ffn_hidden_size=11008 \
 model.num_attention_heads=32 \
 model.normalization=rmsnorm \
 model.do_layer_norm_weight_decay=False \
 model.apply_query_key_layer_scaling=True \
 model.bias=False \
 model.activation=fast-swiglu \
 model.headscale=False \
 model.position_embedding_type=rope \
 model.rotary_percentage=1.0 \
 model.num_query_groups=null \
 model.data.num_workers=31 \
 model.mm_cfg.llm.from_pretrained=$DATA_DIR/checkpoints/llama-2-7b-chat.nemo \
 model.mm_cfg.llm.model_type=v1 \
 model.data.conv_template=v1 \
 model.mm_cfg.vision_encoder.from_pretrained='openai/clip-vit-large-patch14' \
 model.mm_cfg.vision_encoder.from_hf=True \
 model.optim.name="fused_adam" \
 exp_manager.create_checkpoint_callback=True \
 exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
 exp_manager.create_wandb_logger=True \
 exp_manager.resume_if_exists=False

The following shows the error sentences.

terminate called after throwing an instance of 'c10::Error'                                                                             
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: <unknown function> + 0x13ee27 (0x5575903cfe27 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #24: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #42: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #44: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #46: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #48: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #49: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #50: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #52: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #58: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #59: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #60: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #62: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: PyType_GenericAlloc + 0x33a (0x5575903b8bfa in /usr/bin/python)
frame #20: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x75a0 (0x5575903db150 in /usr/bin/python)
frame #22: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #42: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #44: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #46: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #47: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #48: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #50: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #52: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #56: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #57: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #58: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #60: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #62: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: <unknown function> + 0x168614 (0x5575903f9614 in /usr/bin/python)
frame #20: _PyObject_GenericGetAttrWithDict + 0x1cb (0x5575903e960b in /usr/bin/python)
frame #21: PyObject_GetAttr + 0x4d (0x5575903e7e3d in /usr/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x5dc1 (0x5575903d9971 in /usr/bin/python)
frame #23: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #25: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #33: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #37: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #38: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #39: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #41: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #43: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #45: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #47: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #48: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #49: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #51: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #52: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #53: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)
frame #54: <unknown function> + 0x286c5b (0x557590517c5b in /usr/bin/python)
frame #55: PyObject_GetIter + 0x18 (0x5575903c5a18 in /usr/bin/python)
frame #56: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #60: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #62: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: PyType_GenericAlloc + 0x33a (0x5575903b8bfa in /usr/bin/python)
frame #20: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #21: <unknown function> + 0x152771 (0x5575903e3771 in /usr/bin/python)
frame #22: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #23: _PyObject_FastCallDictTstate + 0x253 (0x5575903e0da3 in /usr/bin/python)
frame #24: <unknown function> + 0x14de13 (0x5575903dee13 in /usr/bin/python)
frame #25: <unknown function> + 0x14d99b (0x5575903de99b in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #27: <unknown function> + 0x13f9c6 (0x5575903d09c6 in /usr/bin/python)
frame #28: PyEval_EvalCode + 0x86 (0x5575904c6256 in /usr/bin/python)
frame #29: <unknown function> + 0x23ae2d (0x5575904cbe2d in /usr/bin/python)
frame #30: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #42: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #43: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #44: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #45: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #46: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #48: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #50: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #51: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #52: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #53: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x85a5 (0x5575903dc155 in /usr/bin/python)
frame #55: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #56: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #57: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #59: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #61: <unknown function> + 0x1c2afe (0x557590453afe in /usr/bin/python)
frame #62: <unknown function> + 0x1c292e (0x55759045392e in /usr/bin/python)

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: PyTuple_New + 0xff (0x5575903b4e8f in /usr/bin/python)
frame #21: <unknown function> + 0x1393bf (0x5575903ca3bf in /usr/bin/python)
frame #22: <unknown function> + 0x138ff8 (0x5575903c9ff8 in /usr/bin/python)
frame #23: <unknown function> + 0x13951f (0x5575903ca51f in /usr/bin/python)
frame #24: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #25: <unknown function> + 0x234880 (0x5575904c5880 in /usr/bin/python)
frame #26: <unknown function> + 0x244895 (0x5575904d5895 in /usr/bin/python)
frame #27: <unknown function> + 0x159b34 (0x5575903eab34 in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #33: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #39: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #40: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #41: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #42: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #43: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #44: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #45: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #50: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #51: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #52: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x85a5 (0x5575903dc155 in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #60: <unknown function> + 0x1c2afe (0x557590453afe in /usr/bin/python)
frame #61: <unknown function> + 0x1c292e (0x55759045392e in /usr/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0xbfe (0x5575903d47ae in /usr/bin/python)

  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: PyTuple_New + 0xff (0x5575903b4e8f in /usr/bin/python)
frame #21: <unknown function> + 0x1393bf (0x5575903ca3bf in /usr/bin/python)
frame #22: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #23: <unknown function> + 0x13951f (0x5575903ca51f in /usr/bin/python)
frame #24: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #25: <unknown function> + 0x234880 (0x5575904c5880 in /usr/bin/python)
frame #26: <unknown function> + 0x244895 (0x5575904d5895 in /usr/bin/python)
frame #27: <unknown function> + 0x159b34 (0x5575903eab34 in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #33: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #39: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #40: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #41: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #42: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #43: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #44: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #45: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #50: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #51: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #52: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x32f3 (0x5575903d6ea3 in /usr/bin/python)
frame #54: <unknown function> + 0x13f9c6 (0x5575903d09c6 in /usr/bin/python)
frame #55: PyEval_EvalCode + 0x86 (0x5575904c6256 in /usr/bin/python)
frame #56: <unknown function> + 0x23ae2d (0x5575904cbe2d in /usr/bin/python)
frame #57: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #59: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #61: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)

Epoch 0: :   0%|                                                                                                        | 0/50 [00:00<?]Error executing job with overrides: ['++cluster_type=BCP', 'trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=[0,1]', 'trainer.val_check_interval=50', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=50', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=4', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=2', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data/unagi0/sakamoto/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data/unagi0/sakamoto/neva/datasets/LLaVA-Pretrain-LCS-558K/images', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/data/unagi0/sakamoto/neva/tokenizers/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=11008', 'model.num_attention_heads=32', 'model.normalization=rmsnorm', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.bias=False', 'model.activation=fast-swiglu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=1.0', 'model.num_query_groups=null', 'model.data.num_workers=31', 'model.mm_cfg.llm.from_pretrained=/data/unagi0/sakamoto/neva/checkpoints/llama-2-7b-chat.nemo', 'model.mm_cfg.llm.model_type=v1', 'model.data.conv_template=v1', 'model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.create_wandb_logger=True', 'exp_manager.resume_if_exists=False']
[rank1]: Traceback (most recent call last):
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1130, in _try_get_data
[rank1]:     data = self._data_queue.get(timeout=timeout)
[rank1]:   File "/usr/lib/python3.10/queue.py", line 180, in get
[rank1]:     self.not_empty.wait(remaining)
[rank1]:   File "/usr/lib/python3.10/threading.py", line 324, in wait
[rank1]:     gotit = waiter.acquire(True, timeout)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
[rank1]:     _error_if_any_worker_fails()
[rank1]: RuntimeError: DataLoader worker (pid 1008552) is killed by signal: Aborted. 

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 40, in <module>
[rank1]:     main()
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank1]:     _run_hydra(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank1]:     _run_app(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank1]:     run_and_report(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank1]:     raise ex
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank1]:     return func()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank1]:     lambda: hydra.run(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank1]:     _ = ret.return_value
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank1]:     raise self._return_value
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank1]:     ret.return_value = task_function(task_cfg)
[rank1]:   File "/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 35, in main
[rank1]:     trainer.fit(model)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
[rank1]:     results = self._run_stage()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
[rank1]:     self.fit_loop.run()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
[rank1]:     self.advance()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
[rank1]:     self.epoch_loop.run(self._data_fetcher)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
[rank1]:     self.advance(data_fetcher)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 320, in advance
[rank1]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 192, in run
[rank1]:     self._optimizer_step(batch_idx, closure)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 270, in _optimizer_step
[rank1]:     call._call_lightning_module_hook(
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 171, in _call_lightning_module_hook
[rank1]:     output = fn(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1312, in optimizer_step
[rank1]:     super().optimizer_step(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1302, in optimizer_step
[rank1]:     optimizer.step(closure=optimizer_closure)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 154, in step
[rank1]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank1]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step
[rank1]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1699, in optimizer_step
[rank1]:     _ = closure()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 146, in __call__
[rank1]:     self._result = self.closure(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 131, in closure
[rank1]:     step_output = self._step_fn()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 319, in _training_step
[rank1]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 323, in _call_strategy_hook
[rank1]:     output = fn(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 391, in training_step
[rank1]:     return self.lightning_module.training_step(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/utils/model_utils.py", line 463, in wrap_training_step
[rank1]:     output_dict = wrapped(*args, **kwargs)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 1019, in training_step
[rank1]:     return MegatronGPTModel.training_step(self, dataloader_iter)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 897, in training_step
[rank1]:     loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 817, in training_step_fwd_bwd_step_call
[rank1]:     loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only)
[rank1]:   File "/home/mil/sakamoto/NeMo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 945, in fwd_bwd_step
[rank1]:     batch, _, _ = next(dataloader_iter)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 208, in __next__
[rank1]:     batch, batch_idx, dataloader_idx = super(_DataLoaderIterDataFetcher, fetcher).__next__()
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 61, in __next__
[rank1]:     batch = next(self.iterator)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
[rank1]:     out = next(self._iterator)
[rank1]:   File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
[rank1]:     out[i] = next(self.iterators[i])
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 629, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1326, in _next_data
[rank1]:     idx, data = self._get_data()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
[rank1]:     success, data = self._try_get_data()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1143, in _try_get_data
[rank1]:     raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
[rank1]: RuntimeError: DataLoader worker (pid(s) 1008552, 1008556, 1008559, 1008562) exited unexpectedly
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:668 -> 3

kasago:1001461:1008548 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009421 [1] NCCL INFO comm 0x5575c86fff00 rank 0 nranks 1 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:668 -> 3

kasago:1001461:1008501 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009570 [1] NCCL INFO comm 0x5575a9fced00 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:668 -> 3

kasago:1001461:1008544 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009571 [1] NCCL INFO comm 0x5575bbed4480 rank 0 nranks 1 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:668 -> 3

kasago:1001461:1003249 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009778 [1] NCCL INFO comm 0x5575a76127e0 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:668 -> 3

kasago:1001461:1002702 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009852 [1] NCCL INFO comm 0x5575a516e7b0 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
W0216 15:12:36.529000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:857] Sending process 1001460 closing signal SIGTERM
[rank0]:W0216 15:12:36.605000 140418210715200 torch/_inductor/compile_worker/subproc_pool.py:122] SubprocPool unclean exit
W0216 15:13:06.534000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:874] Unable to shutdown process 1001460 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0216 15:13:06.861000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 1001461) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+3bcc3cddb5.nv24.7', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-16_15:12:36
  host      : kasago
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1001461)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Steps/Code to reproduce bug

I just followed the tutorial for Neva training.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions