Deadlock when freeing memory

dfalbel · June 14, 2022, 12:34pm

Hello,

I’m seeing a deadlock when trying to delete tensors from inside a c10::FreeMemoryCallback, if it ends up executed from a thread that is not the main thread.

I’m using REGISTER_FREE_MEMORY_CALLBACK to register a callback that calls the R garbage collector whenever LibTorch’s allocator needs more memory. That works fine when this is called from the main thread. When the callback is executed from the main thread everything works fine and memory is correctly released.

However, in some situations we call LibTorch from a different thread, because we need the main thread to be free to execute arbitrary R functions (eg backward hooks), when this happens, we call the R GC from the main thread and in turn start deleting allocated Tensors from the main thread.

This is causing a deadlock with a backtrace like this in the main thread:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff51620f4 in __GI___pthread_mutex_lock (mutex=0x555561b739a0) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007fffe6b1de13 in c10::cuda::CUDACachingAllocator::raw_delete(void*) ()
   from /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so
#3  0x00007fffe92e3a4c in c10::TensorImpl::release_resources() ()
   from /home/dfalbel/torch/lantern/build/libtorch/lib/libc10.so
#4  0x00007fffe751c64d in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() ()
   from /home/dfalbel/torch/inst/lib/liblantern.so
#5  0x00007fffe7518c46 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr() ()
   from /home/dfalbel/torch/inst/lib/liblantern.so

And here is the thread that called R gc:

#8  0x00007fffe7eaf7e3 in call_r_gc(bool) () at autograd.cpp:408
#9  0x00007fffe763624f in c10::GarbageCollectorCallback::Execute() () from /home/dfalbel/torch/inst/lib/liblantern.so
#10 0x00007fffe6b3948b in c10::cuda::CUDACachingAllocator::DeviceCachingAllocator::malloc(int, unsigned long, CUstream_st*) () from /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so
#11 0x00007fffe6b3acdc in c10::cuda::CUDACachingAllocator::THCCachingAllocator::malloc(void**, int, unsigned long, CUstr

I know this is not a detailed report and that it’s hard to answer precisely but any hint is appreciated regarding what could be causing the lock.

Thank you very much!