-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Closed as not planned
Labels
Description
Describe the bug
I can't run the Neva pre-training with "pipeline parallel == 2"(Tensor parallel works well).
I got the following error, when I ran this bash script.
torchrun --nproc_per_node=2 --rdzv-endpoint=localhost:29500 $HOME/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
++cluster_type=BCP \
trainer.precision=bf16 \
trainer.num_nodes=1 \
trainer.devices=[0,1] \
trainer.val_check_interval=50 \
trainer.limit_val_batches=5 \
trainer.log_every_n_steps=1 \
trainer.max_steps=50 \
model.megatron_amp_O2=True \
model.micro_batch_size=1 \
model.global_batch_size=4 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=2 \
model.mcore_gpt=True \
model.transformer_engine=True \
model.data.data_path=$DATA_DIR/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json \
model.data.image_folder=$DATA_DIR/datasets/LLaVA-Pretrain-LCS-558K/images \
model.tokenizer.library=sentencepiece \
model.tokenizer.model=$DATA_DIR/tokenizers/tokenizer_neva.model \
model.encoder_seq_length=4096 \
model.num_layers=32 \
model.hidden_size=4096 \
model.ffn_hidden_size=11008 \
model.num_attention_heads=32 \
model.normalization=rmsnorm \
model.do_layer_norm_weight_decay=False \
model.apply_query_key_layer_scaling=True \
model.bias=False \
model.activation=fast-swiglu \
model.headscale=False \
model.position_embedding_type=rope \
model.rotary_percentage=1.0 \
model.num_query_groups=null \
model.data.num_workers=31 \
model.mm_cfg.llm.from_pretrained=$DATA_DIR/checkpoints/llama-2-7b-chat.nemo \
model.mm_cfg.llm.model_type=v1 \
model.data.conv_template=v1 \
model.mm_cfg.vision_encoder.from_pretrained='openai/clip-vit-large-patch14' \
model.mm_cfg.vision_encoder.from_hf=True \
model.optim.name="fused_adam" \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.create_wandb_logger=True \
exp_manager.resume_if_exists=False
The following shows the error sentences.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: <unknown function> + 0x13ee27 (0x5575903cfe27 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #24: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #42: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #44: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #46: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #48: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #49: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #50: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #52: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #58: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #59: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #60: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #62: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: PyType_GenericAlloc + 0x33a (0x5575903b8bfa in /usr/bin/python)
frame #20: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x75a0 (0x5575903db150 in /usr/bin/python)
frame #22: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #42: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #44: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #46: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #47: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #48: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #50: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #52: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #56: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #57: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #58: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #60: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #62: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: <unknown function> + 0x168614 (0x5575903f9614 in /usr/bin/python)
frame #20: _PyObject_GenericGetAttrWithDict + 0x1cb (0x5575903e960b in /usr/bin/python)
frame #21: PyObject_GetAttr + 0x4d (0x5575903e7e3d in /usr/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x5dc1 (0x5575903d9971 in /usr/bin/python)
frame #23: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #25: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #33: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #37: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #38: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #39: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #41: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #43: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #45: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #47: _PyObject_FastCallDictTstate + 0xc4 (0x5575903e0c14 in /usr/bin/python)
frame #48: <unknown function> + 0x164a64 (0x5575903f5a64 in /usr/bin/python)
frame #49: _PyObject_MakeTpCall + 0x1fc (0x5575903e1a1c in /usr/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x64e6 (0x5575903da096 in /usr/bin/python)
frame #51: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #52: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #53: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)
frame #54: <unknown function> + 0x286c5b (0x557590517c5b in /usr/bin/python)
frame #55: PyObject_GetIter + 0x18 (0x5575903c5a18 in /usr/bin/python)
frame #56: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #60: <unknown function> + 0x1687f1 (0x5575903f97f1 in /usr/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #62: <unknown function> + 0x1a54a3 (0x5575904364a3 in /usr/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: PyType_GenericAlloc + 0x33a (0x5575903b8bfa in /usr/bin/python)
frame #20: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #21: <unknown function> + 0x152771 (0x5575903e3771 in /usr/bin/python)
frame #22: _PyObject_MakeTpCall + 0x1a7 (0x5575903e19c7 in /usr/bin/python)
frame #23: _PyObject_FastCallDictTstate + 0x253 (0x5575903e0da3 in /usr/bin/python)
frame #24: <unknown function> + 0x14de13 (0x5575903dee13 in /usr/bin/python)
frame #25: <unknown function> + 0x14d99b (0x5575903de99b in /usr/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #27: <unknown function> + 0x13f9c6 (0x5575903d09c6 in /usr/bin/python)
frame #28: PyEval_EvalCode + 0x86 (0x5575904c6256 in /usr/bin/python)
frame #29: <unknown function> + 0x23ae2d (0x5575904cbe2d in /usr/bin/python)
frame #30: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #32: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #34: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #36: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #38: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #40: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #41: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #42: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #43: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #44: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #45: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #46: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #48: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #50: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #51: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #52: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #53: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x85a5 (0x5575903dc155 in /usr/bin/python)
frame #55: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #56: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #57: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #59: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #61: <unknown function> + 0x1c2afe (0x557590453afe in /usr/bin/python)
frame #62: <unknown function> + 0x1c292e (0x55759045392e in /usr/bin/python)
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: PyTuple_New + 0xff (0x5575903b4e8f in /usr/bin/python)
frame #21: <unknown function> + 0x1393bf (0x5575903ca3bf in /usr/bin/python)
frame #22: <unknown function> + 0x138ff8 (0x5575903c9ff8 in /usr/bin/python)
frame #23: <unknown function> + 0x13951f (0x5575903ca51f in /usr/bin/python)
frame #24: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #25: <unknown function> + 0x234880 (0x5575904c5880 in /usr/bin/python)
frame #26: <unknown function> + 0x244895 (0x5575904d5895 in /usr/bin/python)
frame #27: <unknown function> + 0x159b34 (0x5575903eab34 in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #33: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #39: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #40: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #41: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #42: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #43: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #44: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #45: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #50: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #51: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #52: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x85a5 (0x5575903dc155 in /usr/bin/python)
frame #54: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #55: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #56: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #57: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #58: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #60: <unknown function> + 0x1c2afe (0x557590453afe in /usr/bin/python)
frame #61: <unknown function> + 0x1c292e (0x55759045392e in /usr/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0xbfe (0x5575903d47ae in /usr/bin/python)
what(): CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f7e0a0294b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f7e09fd7842 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f7e5c498802 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1a4be (0x7f7e5c4624be in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1cc4e (0x7f7e5c464c4e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3ba (0x7f7e5c4653ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5150e0 (0x7f7e5b31d0e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7e0a0063f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7cd508 (0x7f7e5b5d5508 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a6 (0x7f7e5b5d5826 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x126107 (0x5575903b7107 in /usr/bin/python)
frame #11: <unknown function> + 0x150660 (0x5575903e1660 in /usr/bin/python)
frame #12: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #13: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #14: <unknown function> + 0x1507a0 (0x5575903e17a0 in /usr/bin/python)
frame #15: <unknown function> + 0x164518 (0x5575903f5518 in /usr/bin/python)
frame #16: <unknown function> + 0x12c4f1 (0x5575903bd4f1 in /usr/bin/python)
frame #17: <unknown function> + 0x129ebc (0x5575903baebc in /usr/bin/python)
frame #18: <unknown function> + 0x231470 (0x5575904c2470 in /usr/bin/python)
frame #19: _PyObject_GC_NewVar + 0x289 (0x5575903ad659 in /usr/bin/python)
frame #20: PyTuple_New + 0xff (0x5575903b4e8f in /usr/bin/python)
frame #21: <unknown function> + 0x1393bf (0x5575903ca3bf in /usr/bin/python)
frame #22: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #23: <unknown function> + 0x13951f (0x5575903ca51f in /usr/bin/python)
frame #24: <unknown function> + 0x138fcc (0x5575903c9fcc in /usr/bin/python)
frame #25: <unknown function> + 0x234880 (0x5575904c5880 in /usr/bin/python)
frame #26: <unknown function> + 0x244895 (0x5575904d5895 in /usr/bin/python)
frame #27: <unknown function> + 0x159b34 (0x5575903eab34 in /usr/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x198c (0x5575903d553c in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #33: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
frame #35: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #37: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #39: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #40: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #41: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #42: PyImport_ImportModuleLevelObject + 0x4b3 (0x5575903ff7e3 in /usr/bin/python)
frame #43: <unknown function> + 0x17eb68 (0x55759040fb68 in /usr/bin/python)
frame #44: <unknown function> + 0x15a10e (0x5575903eb10e in /usr/bin/python)
frame #45: PyObject_Call + 0xbb (0x5575903fa42b in /usr/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #47: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x6bd (0x5575903d426d in /usr/bin/python)
frame #49: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #50: <unknown function> + 0x159dc9 (0x5575903eadc9 in /usr/bin/python)
frame #51: _PyObject_CallMethodIdObjArgs + 0xff (0x5575904cb7cf in /usr/bin/python)
frame #52: PyImport_ImportModuleLevelObject + 0xd49 (0x557590400079 in /usr/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x32f3 (0x5575903d6ea3 in /usr/bin/python)
frame #54: <unknown function> + 0x13f9c6 (0x5575903d09c6 in /usr/bin/python)
frame #55: PyEval_EvalCode + 0x86 (0x5575904c6256 in /usr/bin/python)
frame #56: <unknown function> + 0x23ae2d (0x5575904cbe2d in /usr/bin/python)
frame #57: <unknown function> + 0x15ac59 (0x5575903ebc59 in /usr/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x2a27 (0x5575903d65d7 in /usr/bin/python)
frame #59: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x614a (0x5575903d9cfa in /usr/bin/python)
frame #61: _PyFunction_Vectorcall + 0x7c (0x5575903eb9fc in /usr/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x8ac (0x5575903d445c in /usr/bin/python)
Epoch 0: : 0%| | 0/50 [00:00<?]Error executing job with overrides: ['++cluster_type=BCP', 'trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=[0,1]', 'trainer.val_check_interval=50', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=50', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=4', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=2', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data/unagi0/sakamoto/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data/unagi0/sakamoto/neva/datasets/LLaVA-Pretrain-LCS-558K/images', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/data/unagi0/sakamoto/neva/tokenizers/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=11008', 'model.num_attention_heads=32', 'model.normalization=rmsnorm', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.bias=False', 'model.activation=fast-swiglu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=1.0', 'model.num_query_groups=null', 'model.data.num_workers=31', 'model.mm_cfg.llm.from_pretrained=/data/unagi0/sakamoto/neva/checkpoints/llama-2-7b-chat.nemo', 'model.mm_cfg.llm.model_type=v1', 'model.data.conv_template=v1', 'model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.create_wandb_logger=True', 'exp_manager.resume_if_exists=False']
[rank1]: Traceback (most recent call last):
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1130, in _try_get_data
[rank1]: data = self._data_queue.get(timeout=timeout)
[rank1]: File "/usr/lib/python3.10/queue.py", line 180, in get
[rank1]: self.not_empty.wait(remaining)
[rank1]: File "/usr/lib/python3.10/threading.py", line 324, in wait
[rank1]: gotit = waiter.acquire(True, timeout)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
[rank1]: _error_if_any_worker_fails()
[rank1]: RuntimeError: DataLoader worker (pid 1008552) is killed by signal: Aborted.
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 40, in <module>
[rank1]: main()
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank1]: _run_hydra(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank1]: _run_app(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank1]: run_and_report(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank1]: raise ex
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank1]: return func()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank1]: lambda: hydra.run(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank1]: _ = ret.return_value
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank1]: raise self._return_value
[rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank1]: ret.return_value = task_function(task_cfg)
[rank1]: File "/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 35, in main
[rank1]: trainer.fit(model)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
[rank1]: results = self._run_stage()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
[rank1]: self.advance()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
[rank1]: self.advance(data_fetcher)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 320, in advance
[rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 192, in run
[rank1]: self._optimizer_step(batch_idx, closure)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 270, in _optimizer_step
[rank1]: call._call_lightning_module_hook(
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 171, in _call_lightning_module_hook
[rank1]: output = fn(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1312, in optimizer_step
[rank1]: super().optimizer_step(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1302, in optimizer_step
[rank1]: optimizer.step(closure=optimizer_closure)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 154, in step
[rank1]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank1]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step
[rank1]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1699, in optimizer_step
[rank1]: _ = closure()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 146, in __call__
[rank1]: self._result = self.closure(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 131, in closure
[rank1]: step_output = self._step_fn()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 319, in _training_step
[rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 323, in _call_strategy_hook
[rank1]: output = fn(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 391, in training_step
[rank1]: return self.lightning_module.training_step(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/utils/model_utils.py", line 463, in wrap_training_step
[rank1]: output_dict = wrapped(*args, **kwargs)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 1019, in training_step
[rank1]: return MegatronGPTModel.training_step(self, dataloader_iter)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 897, in training_step
[rank1]: loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 817, in training_step_fwd_bwd_step_call
[rank1]: loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only)
[rank1]: File "/home/mil/sakamoto/NeMo/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py", line 945, in fwd_bwd_step
[rank1]: batch, _, _ = next(dataloader_iter)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 208, in __next__
[rank1]: batch, batch_idx, dataloader_idx = super(_DataLoaderIterDataFetcher, fetcher).__next__()
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 61, in __next__
[rank1]: batch = next(self.iterator)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
[rank1]: out = next(self._iterator)
[rank1]: File "/home/mil/sakamoto/.local/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
[rank1]: out[i] = next(self.iterators[i])
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 629, in __next__
[rank1]: data = self._next_data()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1326, in _next_data
[rank1]: idx, data = self._get_data()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
[rank1]: success, data = self._try_get_data()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1143, in _try_get_data
[rank1]: raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
[rank1]: RuntimeError: DataLoader worker (pid(s) 1008552, 1008556, 1008559, 1008562) exited unexpectedly
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009421 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008548 [1] NCCL INFO misc/socket.cc:668 -> 3
kasago:1001461:1008548 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009421 [1] NCCL INFO comm 0x5575c86fff00 rank 0 nranks 1 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009570 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008501 [1] NCCL INFO misc/socket.cc:668 -> 3
kasago:1001461:1008501 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009570 [1] NCCL INFO comm 0x5575a9fced00 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009571 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1008544 [1] NCCL INFO misc/socket.cc:668 -> 3
kasago:1001461:1008544 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009571 [1] NCCL INFO comm 0x5575bbed4480 rank 0 nranks 1 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009778 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1003249 [1] NCCL INFO misc/socket.cc:668 -> 3
kasago:1001461:1003249 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009778 [1] NCCL INFO comm 0x5575a76127e0 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:550 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:573 -> 3
kasago:1001461:1009852 [1] NCCL INFO misc/socket.cc:621 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:47 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:752 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:428 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:564 -> 3
kasago:1001461:1002702 [1] NCCL INFO misc/socket.cc:668 -> 3
kasago:1001461:1002702 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
kasago:1001461:1009852 [1] NCCL INFO comm 0x5575a516e7b0 rank 1 nranks 2 cudaDev 1 busId 2d000 - Abort COMPLETE
W0216 15:12:36.529000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:857] Sending process 1001460 closing signal SIGTERM
[rank0]:W0216 15:12:36.605000 140418210715200 torch/_inductor/compile_worker/subproc_pool.py:122] SubprocPool unclean exit
W0216 15:13:06.534000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:874] Unable to shutdown process 1001460 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0216 15:13:06.861000 140050749429568 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 1001461) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0a0+3bcc3cddb5.nv24.7', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/mil/sakamoto/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-16_15:12:36
host : kasago
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1001461)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Steps/Code to reproduce bug
I just followed the tutorial for Neva training.
Reactions are currently unavailable