Multithread deadlock

(Pascal Soveaux) #1

Hello,

I am using v1.0.1/CUDA 10.1 on a 2080. glibc is 2.28.

I read that libtorch is multithread safe but I cannot make it work correctly in a MT C++ app. I have 2 trainers each one in its own thread, no global var, no static var, no shared var… but it always ends in a deadlock.

When deadlocked both threads are near cuLaunchKernel. Is there something I need to do ? Do I have to initialize a context or something?

I also tested with master and a nightly of libtorch but it ends the same.

Thanks,

Pascal

(Pascal Soveaux) #2

Hello,

Great! it seems that PoolWindow makes it work correctly.

I run into a problem where available_handles (global) was destructed before PoolWindow (thread_local) (Valgrind told me).

available_handles[d_h.first].push_back(d_h.second); // SIGSEGV

I dont know how this can happen, I will try to reproduce.

Pascal

(Will Feng) #3

@pascal.soveaux https://github.com/pytorch/pytorch/issues/19394 is likely related. Would you like to comment in the issue about your findings? Thanks!

(Pascal Soveaux) #4

@yf225 Sorry for the late response. I found that this problem is totally not related with poolwindow. The thread is exited after the “main” (libtorch) thread because both are worker threads in fact.