Multithreading Segmentation fault

I’m trying to implement a multi-thread and single-GPU Actor-Critic training, where 1 thread is doing simulation and 1 thread is doing training.
The two threads will share a replay buffer (on CPU) and a model (on GPU).

An example code is here in the gist

If you run the code, it will occasionally give Segmentation fault. But real tasks, the Segmentation fault happens must faster.

When I use the net_lock to lock the model, i.e., only one thread can use the model at a time, the problem goes away. However, this will be obviously less efficient. My hope is that both threads can use the model at the same time, and I don’t worry about the undetermined behavior problem for now.

Any idea why this is happening?

1 Like

Some update:

It seems the problem is that the forward and backward computation of Pytorch is not thread-safe. I’m not sure whether this is the problem introduced by Autograd. Can anyone confirm whether PyTorch is thread-safe or not?

it should be thread-safe, and if it’s not, we will fix it.

I’ve tested your example against the master branch of pytorch, and it did not produce segfaults.
I wonder if we fixed the issue you are seeing already.
Can you give the master branch a try to confirm?

Hi Soumith, I just saw your reply and went to try my earlier script.

I ran two versions of the actual training code, one with a thread lock on the model and the other without the lock. It turned out the one with lock is stilling running now (>1 hour), and the one without lock gave Segmentation fault sooner or later (I tried more than once).

I also tried the script I put in the gist and I also got the error.
To recover the error, I usually run the script for 20 times by running for i in `seq 0 20`; do python; done
(It’s better to put the shell script in a .sh file in case you wanna kill it).

BTW, it happens with both GPU and CPU (make the network smaller so that the script is faster for running 20 runs).

For your information, my environment is python2.7 with cuda 8.0.

@Zihang_Dai but what is your print(torch.__version__)?

I installed from the master branch and version is 0.1.12+1572173

i could reproduce this, and I opened an issue here: Multithreading Segmentation fault
please follow along, and thanks a lot for the bug report.