Terminate called after throwing an instance of 'c10::Error'

FreemanG · July 21, 2019, 9:22am

I am using pytorch 1.1 (python 3.6, cuda 10, ubuntu 18.04).
Test with an official imagenet code pytorch example imagenet.

Kept running into an unspecified launch failure:

terminate called after throwing an instance of 'c10::Error'                                          
  what():  CUDA error: unspecified launch failure (insert_events at /pytorch/c10/cuda/CUDACachingAllo
cator.cpp:564)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fe33af7f441 in /home/ana
conda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libc10.so)                                   
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fe33af7ed7a in /home/
anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libc10.so)                          
frame #2: <unknown function> + 0x13652 (0x7fe33a51e652 in /home/anaconda3/envs/pt1.1/lib/python
3.6/site-packages/torch/lib/libc10_cuda.so)                                                          
frame #3: c10::TensorImpl::release_resources() + 0x50 (0x7fe33af6fce0 in /home/anaconda3/envs/p
t1.1/lib/python3.6/site-packages/torch/lib/libc10.so)                                                
frame #4: <unknown function> + 0x30facb (0x7fe2d6406acb in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch.so.1)                                                          
frame #5: <unknown function> + 0x1420bb (0x7fe33b5080bb in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)                                                     
frame #6: <unknown function> + 0x6be7c1 (0x7fe33ba847c1 in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)                                                     
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f
e33ba84902 in /home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libtorch_python.
so)                                                                                                  
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0xa2 (0x7fe33b4e4bb2 in /
home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libtorch_python.so)            
frame #9: <unknown function> + 0x6b375b (0x7fe33ba7975b in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)                                                     
frame #10: <unknown function> + 0x12fbc7 (0x7fe33b4f5bc7 in /home/anaconda3/envs/pt1.1/lib/pyth
on3.6/site-packages/torch/lib/libtorch_python.so)                                                    
frame #11: <unknown function> + 0x12fe2e (0x7fe33b4f5e2e in /home/anaconda3/envs/pt1.1/lib/pyth
on3.6/site-packages/torch/lib/libtorch_python.so)
...

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
 line 19, in _wrap
    fn(i, *args)
  File "/share/imagenet/main.py", line 238, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/share/imagenet/main.py", line 301, in train
    progress.display(i)
  File "/share/imagenet/main.py", line 386, in display
    entries += [str(meter) for meter in self.meters]
  File "/share/imagenet/main.py", line 386, in <listcomp>
    entries += [str(meter) for meter in self.meters]
  File "/share/imagenet/main.py", line 375, in __str__
    return fmtstr.format(**self.__dict__)
  File "/home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/tensor.py", line 386, in $
_format__
    return self.item().__format__(format_spec)
RuntimeError: CUDA error: unspecified launch failure

sherylwang · September 20, 2019, 3:33am

Have you solved the problem? I met the same error recently.

FreemanG · September 20, 2019, 8:50am

Actually, I am not sure how I solved this. After I re-installed pytorch, cuda/cudnn, nvidia-driver, the problem had gone away.

Also, this could be a hardware problem, since the problem occured on a flawed workstation (multi-gpu training might make it crash).

Ziyu_Huang · February 14, 2022, 2:42pm

I met a same problem when I am running 2 GPU task on school’s server…But I rerun it and it disappear! Maybe this is just hardware communication occational problem…