Pytorch - multi-threading seg fault when using aysnc-TD3 structure

Muye_Jia · July 17, 2023, 4:12pm

Hello

I’m currently trying to instantiate several threads to spawn actors for async TD3 network structure, but after the threads ran for some time, a segmentation fault occurs and gdb trace gives the following report:

terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x7ffe813e8200
Exception raised from free at ../c10/cuda/CUDACachingAllocator.cpp:2058 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fffe2c884d7 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fffe2c5236b in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x22f0e (0x7fffe7427f0e in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4ccb76 (0x7fff90d6fb76 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: at::get_overlap_status(c10::TensorImpl*, c10::TensorImpl*) + 0x70c (0x7fff792b063c in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::assert_no_partial_overlap(at::TensorBase const&, at::TensorBase const&) + 0xf (0x7fff792b070f in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::TensorIteratorBase::compute_mem_overlaps(at::TensorIteratorConfig const&) + 0x110 (0x7fff792ed280 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x43 (0x7fff792f2913 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xb2 (0x7fff792f3f22 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7fff795c48be in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b7b847 (0x7fff544ab847 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x74 (0x7fff7a101e94 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4749f75 (0x7fff7c139f75 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x74 (0x7fff7a101e94 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4185613 (0x7fff7bb75613 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) + 0x13e (0x7fff7a13b9fe in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x49ec2c6 (0x7fff7c3dc2c6 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::AccumulateGrad::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x106 (0x7fff7c3ddd86 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x49e852b (0x7fff7c3d852b in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0xe8d (0x7fff7c3d193d in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x6b0 (0x7fff7c3d2cb0 in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x8b (0x7fff7c3c99eb in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4f (0x7fff90fd264f in /home/contractor/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #23: <unknown function> + 0xd6de4 (0x7fffef918de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #24: <unknown function> + 0x8609 (0x7ffff7d75609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #25: clone + 0x43 (0x7ffff7eaf133 in /lib/x86_64-linux-gnu/libc.so.6)

what might be the problem? It says the problem occurs at CUDACachingAllocator, is this because I’m running out of system RAM or GPU RAM?

thanks!

kumpera · July 17, 2023, 7:17pm

You’re likely running out of GPU RAM.

Try adding some logging of memory usage to your code and see how high it was close to the failure point.

Muye_Jia · July 17, 2023, 11:14pm

Hi Rodrigo, thanks for the suggestion! I checked it using nvidia smi and it shows the GPU is not highly used because the model I used is pretty small (one model training usually takes 200 MB of GPU RAM).

I’ve seen implementation that uses a shared optimizer class, but not sure if that makes a difference, maybe I will try that, thanks!

Muye_Jia · July 24, 2023, 8:15pm

I’m able to solve this problem by using shared optimizer class as implemented in Morvan Zhou’s A3C implementation on Github