I am using pytorch 1.1 (python 3.6, cuda 10, ubuntu 18.04).
Test with an official imagenet code pytorch example imagenet.
Kept running into an unspecified launch failure:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure (insert_events at /pytorch/c10/cuda/CUDACachingAllo
cator.cpp:564)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fe33af7f441 in /home/ana
conda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fe33af7ed7a in /home/
anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x13652 (0x7fe33a51e652 in /home/anaconda3/envs/pt1.1/lib/python
3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x50 (0x7fe33af6fce0 in /home/anaconda3/envs/p
t1.1/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x30facb (0x7fe2d6406acb in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: <unknown function> + 0x1420bb (0x7fe33b5080bb in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6be7c1 (0x7fe33ba847c1 in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f
e33ba84902 in /home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libtorch_python.
so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0xa2 (0x7fe33b4e4bb2 in /
home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x6b375b (0x7fe33ba7975b in /home/anaconda3/envs/pt1.1/lib/pytho
n3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12fbc7 (0x7fe33b4f5bc7 in /home/anaconda3/envs/pt1.1/lib/pyth
on3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x12fe2e (0x7fe33b4f5e2e in /home/anaconda3/envs/pt1.1/lib/pyth
on3.6/site-packages/torch/lib/libtorch_python.so)
...
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args)
File "/share/imagenet/main.py", line 238, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/share/imagenet/main.py", line 301, in train
progress.display(i)
File "/share/imagenet/main.py", line 386, in display
entries += [str(meter) for meter in self.meters]
File "/share/imagenet/main.py", line 386, in <listcomp>
entries += [str(meter) for meter in self.meters]
File "/share/imagenet/main.py", line 375, in __str__
return fmtstr.format(**self.__dict__)
File "/home/anaconda3/envs/pt1.1/lib/python3.6/site-packages/torch/tensor.py", line 386, in $
_format__
return self.item().__format__(format_spec)
RuntimeError: CUDA error: unspecified launch failure