Multithreading Segmentation fault

Zihang_Dai · May 31, 2017, 6:22pm

I’m trying to implement a multi-thread and single-GPU Actor-Critic training, where 1 thread is doing simulation and 1 thread is doing training.
The two threads will share a replay buffer (on CPU) and a model (on GPU).

An example code is here in the gist https://gist.github.com/zihangdai/fc8f76fbb8a0f6323a6b31e6d98ceb50

If you run the code, it will occasionally give Segmentation fault. But real tasks, the Segmentation fault happens must faster.

When I use the net_lock to lock the model, i.e., only one thread can use the model at a time, the problem goes away. However, this will be obviously less efficient. My hope is that both threads can use the model at the same time, and I don’t worry about the undetermined behavior problem for now.

Any idea why this is happening?

Zihang_Dai · June 3, 2017, 12:11am

Some update:

It seems the problem is that the forward and backward computation of Pytorch is not thread-safe. I’m not sure whether this is the problem introduced by Autograd. Can anyone confirm whether PyTorch is thread-safe or not?

smth · June 16, 2017, 1:37pm

it should be thread-safe, and if it’s not, we will fix it.

I’ve tested your example against the master branch of pytorch, and it did not produce segfaults.
I wonder if we fixed the issue you are seeing already.
Can you give the master branch a try to confirm?

Zihang_Dai · June 21, 2017, 6:12am

Hi Soumith, I just saw your reply and went to try my earlier script.

I ran two versions of the actual training code, one with a thread lock on the model and the other without the lock. It turned out the one with lock is stilling running now (>1 hour), and the one without lock gave Segmentation fault sooner or later (I tried more than once).

I also tried the script I put in the gist and I also got the error.
To recover the error, I usually run the script for 20 times by running for i in `seq 0 20`; do python multithread.py; done
(It’s better to put the shell script in a .sh file in case you wanna kill it).

BTW, it happens with both GPU and CPU (make the network smaller so that the script is faster for running 20 runs).

For your information, my environment is python2.7 with cuda 8.0.

smth · June 21, 2017, 2:05pm

@Zihang_Dai but what is your print(torch.__version__)?

Zihang_Dai · June 21, 2017, 4:50pm

I installed from the master branch and version is 0.1.12+1572173

smth · June 21, 2017, 6:37pm

i could reproduce this, and I opened an issue here: Multithreading Segmentation fault
please follow along, and thanks a lot for the bug report.