Deadlock during backward

While trying to run a program on an RHEL cluster, the code gets stuck at the backward step. The RHEL pytroch is a gpu version. This problem does not occur on my PC which has a cpu version.

Any chance it’s related to: https://github.com/pytorch/pytorch/pull/1243? Anyway, how do I go about fixing it?

Below is the gdb backtrace:

(gdb) backtrace
#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007fffc7ac3a26 in std::condition_variable::wait(std::unique_lockstd::mutex&) ()
at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc+±v3/include/x86_64-conda_cos6-linux-gnu/bits/gthr-default.h:877
#2 0x00007fffe8886573 in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) ()
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/condition_variable:98
#3 0x00007fffe88b1d7c in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) ()
at torch/csrc/autograd/python_engine.cpp:58
#4 0x00007fffe88b29ae in THPEngine_run_backward(THPEngine*, _object*, _object*) ()
at torch/csrc/autograd/python_engine.cpp:166
#5 0x00007ffff7e4ffd4 in _PyCFunction_FastCallDict ()
#6 0x00007ffff7e7df24 in _PyCFunction_FastCallKeywords ()
#7 0x00007ffff7eddbec in call_function ()
#8 0x00007ffff7f02eb1 in _PyEval_EvalFrameDefault ()
#9 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#10 0x00007ffff7ed7a11 in fast_function ()
#11 0x00007ffff7eddcc5 in call_function ()
#12 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#13 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#14 0x00007ffff7ed7a11 in fast_function ()
#15 0x00007ffff7eddcc5 in call_function ()
#16 0x00007ffff7f02eb1 in _PyEval_EvalFrameDefault ()
#17 0x00007ffff7ed8529 in PyEval_EvalCodeEx ()
#18 0x00007ffff7ed92cc in PyEval_EvalCode ()
#19 0x00007ffff7effa8d in builtin_exec ()
#20 0x00007ffff7e53051 in PyCFunction_Call ()
#21 0x00007ffff7f0742b in _PyEval_EvalFrameDefault ()
#22 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#23 0x00007ffff7ed7a11 in fast_function ()
#24 0x00007ffff7eddcc5 in call_function ()
#25 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#26 0x00007ffff7ed77db in fast_function ()
#27 0x00007ffff7eddcc5 in call_function ()
—Type to continue, or q to quit—
#28 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#29 0x00007ffff7ed77db in fast_function ()
#30 0x00007ffff7eddcc5 in call_function ()
#31 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#32 0x00007ffff7ed77db in fast_function ()
#33 0x00007ffff7eddcc5 in call_function ()
#34 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#35 0x00007ffff7ed7e4b in _PyFunction_FastCallDict ()
#36 0x00007ffff7e5039f in _PyObject_FastCallDict ()
#37 0x00007ffff7e93650 in _PyObject_CallMethodIdObjArgs ()
#38 0x00007ffff7e46ea0 in PyImport_ImportModuleLevelObject ()
#39 0x00007ffff7f04b2c in _PyEval_EvalFrameDefault ()
#40 0x00007ffff7ed8529 in PyEval_EvalCodeEx ()
#41 0x00007ffff7ed92cc in PyEval_EvalCode ()
#42 0x00007ffff7f55af4 in run_mod ()
#43 0x00007ffff7e1c930 in PyRun_InteractiveOneObjectEx ()
#44 0x00007ffff7e1cae6 in PyRun_InteractiveLoopFlags ()
#45 0x00007ffff7e1cb86 in PyRun_AnyFileExFlags.cold.2769 ()
#46 0x00007ffff7e1eb69 in Py_Main.cold.2794 ()
#47 0x00007ffff7e2171e in main ()

@apaszke - I wonder if you have any ideas since you seem to have taken a look at a similar case before: https://discuss.pytorch.org/t/solved-archlinux-using-variable-backwards-appears-to-hang-program-indefinitely/1675/11

what’s the pytorch version?

It’s the latest version - 0.4.0

Could you post a minimal example? Thanks!

Thanks, @SimonW. Below is a MWE that is much simpler than what I had posted earlier. Turns out that the deadlock is caused only when N=1; values greater than 1 don’t seem to cause the deadlock.

import torch
from torch.autograd import Variable

N = 1   # N = 1 CAUSES DEADLOCK, N > 1 DOESN'T
D = 2

x = Variable(torch.randn(N, D), requires_grad=False)
y = Variable(torch.randn(N, D), requires_grad=False)

w1 = Variable(torch.randn(D,D), requires_grad=True)

y_pred = x.mm(w1)

loss = (y_pred - y).pow(2).sum()

learning_rate = 1e-6
print('Check1')
loss.backward(retain_graph=True)
print('Check2')

w1.data -= learning_rate * w1.grad.data

w1.grad.data.zero_()

@Santosh_Manicka could you try building from source and seeing if the problem still persists? I can’t reproduce this on the master branch.

Thank you, @richard, will try that. Are you able to reproduce it though on the binary version (if by any chance you have it)?

I tried it with 0.4.0 (installed via conda) and it works without a deadlock.
(I removed the Variable part)

Thank you, @ptrblck. I tried removing the Variable part for x and y, but it still causes a deadlock on mine. So, my installation is on a REHL 6.9 (Santiago) cluster system though, and it’s a gpu version. Also, as I had mentioned earlier, the code works fine on my Windows PC which has a cpu version.

I’m sorry @ptrblck, actually I should have set N=1 in the code, as that’s what causes the deadlock. Could you please try it again?

I’m sorry @richard, actually I should have set N=1 in the code, as that’s what causes the deadlock. Could you please try it again?

I tried it with N=1 and N=2 and both worked.
However, I can test it on another machine later and see, if it gets a deadlock.

I tried with N=1 on pytorch 0.4 and on master (on two different machines).

Hi @ptrblck, would it be possible to post your sys.path here? It has helped me before with a seg fault issue, when I found that the code was picking up numpy from a different installation than anaconda. Thanks.

Sure! Here it is:

['',
 '/home/ptrblck/anaconda2/lib/python27.zip',
 '/home/ptrblck/anaconda2/lib/python2.7',
 '/home/ptrblck/anaconda2/lib/python2.7/plat-linux2',
 '/home/ptrblck/anaconda2/lib/python2.7/lib-tk',
 '/home/ptrblck/anaconda2/lib/python2.7/lib-old',
 '/home/ptrblck/anaconda2/lib/python2.7/lib-dynload',
 '/home/ptrblck/.local/lib/python2.7/site-packages',
 '/home/ptrblck/anaconda2/lib/python2.7/site-packages',
 '/home/ptrblck/anaconda2/lib/python2.7/site-packages/Sphinx-1.5.1-py2.7.egg',
 '/home/ptrblck/anaconda2/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg',
 '/home/ptrblck/anaconda2/lib/python2.7/site-packages/IPython/extensions',
 '/home/ptrblck/.ipython']

I tested it on another machine and it works like in @richard’s case.

Thanks! Mine is:

['', '/cluster/tufts/software/anaconda3/lib/python36.zip', '/cluster/tufts/software/anaconda3/lib/python3.6', '/cluster/tufts/software/anaconda3/lib/python3.6/lib-dynload', '/cluster/home/smanic02/.local/lib/python3.6/site-packages', '/cluster/tufts/software/anaconda3/lib/python3.6/site-packages', '/cluster/tufts/software/anaconda3/lib/python3.6/site-packages/Mako-1.0.7-py3.6.egg']

Could anaconda3 and python3.6 on my end make a difference? I wonder.

I tested it in my Python3 environment with PyTorch 0.4.0 and it also works.
The sys.path of this env in case you would like to see it:

['',
 '/home/ptrblck/anaconda2/envs/py3/lib/python36.zip',
 '/home/ptrblck/anaconda2/envs/py3/lib/python3.6',
 '/home/ptrblck/anaconda2/envs/py3/lib/python3.6/lib-dynload',
 '/home/ptrblck/anaconda2/envs/py3/lib/python3.6/site-packages',
 '/home/ptrblck/anaconda2/envs/py3/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg']

Thanks! I tried it on anaconda/2 python2.7, which our cluster fortunately had, but it still runs into a deadlock.

It might be environment dependent. It might be better to contact the cluster administrator?

Thanks, @SimonW. I have contacted them, but the cause of the problem is not apparent to them as well. The puzzle is why N = 2,3,… work but not N=1. Could it be that for N>1, some kind of multiprocessing kicks in even when I didn’t want it to?