Multigpu, Segmentation fault

I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.

My pyTorch version is 1.4.0, cuda version is 10.1.

terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x7f3872000000 (free at /pytorch/c10/cuda/CUDACachingAllocator.cpp:349)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3e05796193 in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/l
ib/libc10.so)
frame #1: <unknown function> + 0x19e68 (0x7f3e07c20e68 in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f3e0578663d in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x18b2909 (0x7f3d5e933909 in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x38b9a3d (0x7f3d6093aa3d in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x
10fc (0x7f3d610921dc in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x4b2 (0x7f3d61093082 in /home/WeicongChen/anaconda
3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f3d6108c979 in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7f3e0cb8a91a in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/lib/
libtorch_python.so)
frame #9: <unknown function> + 0xedef (0x7f3e0d7b1def in /home/WeicongChen/anaconda3/lib/python3.7/site-packages/torch/_C.cpython-37m-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x76db (0x7f3e11ea76db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #11: clone + 0x3f (0x7f3e11bd088f in /lib/x86_64-linux-gnu/libc.so.6)

Could you install the latest nightly binary and rerun the code, please?

1 Like

It works, thank you!

This really worked…