When I use a simple network as the backbone, it can be trained normally. If I use a complex network for training, such an error will be reported.
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: unspecified launch failure (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fe222fa3193 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so)
frame #1: + 0x17f66 (0x7fe2231e0f66 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x19cbd (0x7fe2231e2cbd in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fe222f9363d in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so)
frame #4: c10d::Reducer::~Reducer() + 0x449 (0x7fe2245b0b19 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fe22458e8f2 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fe223de8336 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #7: + 0x9f952b (0x7fe22458f52b in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2942d0 (0x7fe223e2a2d0 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x29555e (0x7fe223e2b55e in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x586885]
frame #11: /usr/bin/python() [0x56d0d5]
frame #12: /usr/bin/python() [0x4e9767]
frame #13: /usr/bin/python() [0x51b357]
frame #14: /usr/bin/python() [0x51b36d]
frame #15: /usr/bin/python() [0x5beb98]
frame #16: /usr/bin/python() [0x5bec2e]
frame #17: /usr/bin/python() [0x62ee33]
frame #18: PyEval_EvalFrameEx + 0x4f5f (0x53fcdf in /usr/bin/python)
frame #19: PyEval_EvalFrameEx + 0x49f4 (0x53f774 in /usr/bin/python)
frame #20: PyEval_EvalFrameEx + 0x49f4 (0x53f774 in /usr/bin/python)
frame #21: /usr/bin/python() [0x5441d9]
frame #22: PyEval_EvalFrameEx + 0x50de (0x53fe5e in /usr/bin/python)
frame #23: /usr/bin/python() [0x5441d9]
frame #24: PyEval_EvalCode + 0x1f (0x544eaf in /usr/bin/python)
frame #25: PyRun_StringFlags + 0x8f (0x57bd1f in /usr/bin/python)
frame #26: PyRun_SimpleStringFlags + 0x3c (0x6257ac in /usr/bin/python)
frame #27: Py_Main + 0x581 (0x63efe1 in /usr/bin/python)
frame #28: main + 0xe1 (0x4d13f1 in /usr/bin/python)
frame #29: __libc_start_main + 0xf0 (0x7fe22868c840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #30: _start + 0x29 (0x5d62d9 in /usr/bin/python)
Traceback (most recent call last):
File “train.py”, line 367, in
main()
File “train.py”, line 44, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg, val_dataset))
File “/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:
– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py”, line 19, in wrap
fn(i, *args)
File “/tensorflow-facenet/train.py”, line 298, in main_worker
optimizer.step()
File “/usr/local/lib/python3.5/dist-packages/torch/optim/lr_scheduler.py”, line 66, in wrapper
return wrapped(*args, **kwargs)
File “/usr/local/lib/python3.5/dist-packages/torch/optim/sgd.py”, line 100, in step
buf.mul(momentum).add_(1 - dampening, d_p)
RuntimeError: CUDA error: unspecified launch failure