Pytorch multiprocessing queues segmentation fault

gcnn · May 16, 2019, 7:56am

I am using PyTorch multiprocessing queues in order to exchange data between the subprocesses and the father process. These subprocesses are used to sample data from a simulation environment which then will be used in order to train a network. After many iterations of sampling which execute correctly, I get a segmentation fault accessing the queue without traceback. I really don’t know what is happening: Here is the part related to the queues and the subprocess:

        batch = []

        for i in range(6):
            batch_range = (batch_size // 6 * i, batch_size // 6 * (i + 1))
            batch += drain(env, gamma, policy, horizon, render, speedup, seed, batch_range, in_queue, out_queue)

def drain(env, gamma, policy, horizon, render, speedup, seed, batch_range, in_queue, out_queue):

    for i in range(batch_range[0], batch_range[1]):
        in_queue.put((env, gamma, policy, horizon, render, speedup, seed * 1000 + i))

    batch = []
    for i in range(batch_range[0], batch_range[1]):
        batch.append(out_queue.get())

    return batch

I tried to split the processed data in chunks because I thought the amount of data exchanged over the out queue could be an issue.
Some details:

The queues are torch.multiprocessing.Queue
The policy is a class containing a PyTorch network used to act on the simulation environment

gcnn · June 17, 2019, 3:55pm

I managed to get a traceback of the error using GDB

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe1800a73 in THDoubleStorage_get ()
   from /home/user/miniconda3/envs/MyFolder/lib/python3.5/site-packages/torch/lib/libcaffe2.so
(gdb) backtrace
#0  0x00007fffe1800a73 in THDoubleStorage_get ()
   from /home/user/miniconda3/envs/MyFolder/lib/python3.5/site-packages/torch/lib/libcaffe2.so
#1  0x00007fffe169cea7 in at::CPUDoubleTensor::localScalar() ()
   from /home/user/miniconda3/envs/MyFolder/lib/python3.5/site-packages/torch/lib/libcaffe2.so
#2  0x00007fffe33ede95 in torch::autograd::Variable::Impl::localScalar (this=<optimized out>)
    at torch/csrc/autograd/variable.cpp:76
#3  0x00007fffe3704494 in at::Tensor::toCDouble (this=0x7fff4d6d6250)
    at /opt/conda/conda-bld/pytorch_1532499195775/work/torch/lib/tmp_install/include/ATen/TensorMethods.h:1432
#4  torch::autograd::dispatch_to_CDouble (self=...)
    at torch/csrc/autograd/generated/python_variable_methods.cpp:250
#5  0x00007fffe37050ce in torch::autograd::THPVariable_item (self=0x7fff4d6d6240, 
    args=<optimized out>) at torch/csrc/autograd/generated/python_variable_methods.cpp:453
#6  0x00005555556ff4a4 in PyEval_EvalFrameEx ()
#7  0x00005555556ff351 in PyEval_EvalFrameEx ()
#8  0x00005555556ff351 in PyEval_EvalFrameEx ()
#9  0x00005555556fac20 in PyEval_EvalFrameEx ()
#10 0x00005555557052ad in PyEval_EvalCodeEx ()
#11 0x00005555557061fc in PyEval_EvalCode ()
#12 0x00005555557638d4 in run_mod ()
#13 0x0000555555764f41 in PyRun_FileExFlags ()
#14 0x000055555576515e in PyRun_SimpleFileExFlags ()
#15 0x000055555576580d in Py_Main ()
#16 0x000055555562f571 in main ()

does someone know what is going on here?