[BUG]: MoE example GroupNCCL cleanup error during training done and exiting CUDA

### 🐛 Describe the bug

I ran into error when training the MoE example(https://github.com/hpcaitech/ColossalAI-Examples/tree/5b23e8cf22cf029b9ac77c2ed92bbc339e7fbd4e/image/moe), each time when upon finishing the last iteration, it threw the following errors while CUDA shutting down:

[Epoch 99 / Test]: 100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.12it/s, accuracy=0.88235, loss=0.847, throughput=3424.7 sample_per_sec]
[08/09/22 19:03:05] INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.8/site-packages/colossalai/trainer/hooks/_log_hook.py:104 after_test_epoch
                    INFO     colossalai - colossalai - INFO: [Epoch 99 / Test]: Accuracy = 0.8823 | Loss = 0.79231 | Throughput = 3403.1
terminate called after throwing an instance of ‘c10::CUDAError’
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f1817bb01bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f1855a3f6ea in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f1855a41cd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f1855a42f65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f18ada80039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x76db (0x7f18cdd4f6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x3f (0x7f18cda7871f in /lib/x86_64-linux-gnu/libc.so.6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 7730) of binary: /opt/conda/bin/python

### Environment

CUDA Version: 11.3
PyTorch Version: 1.11.0
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓
Python: 3.8.12

colossalai             0.1.8+torch1.11cu11.3
energonai             0.0.1b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: MoE example GroupNCCL cleanup error during training done and exiting CUDA #1427

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: MoE example GroupNCCL cleanup error during training done and exiting CUDA #1427

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions