While trying to run a program on an RHEL cluster, the code gets stuck at the backward step. The RHEL pytroch is a gpu version. This problem does not occur on my PC which has a cpu version.
Any chance it’s related to: https://github.com/pytorch/pytorch/pull/1243 ? Anyway, how do I go about fixing it?
Below is the gdb backtrace:
(gdb) backtrace
#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007fffc7ac3a26 in std::condition_variable::wait(std::unique_lockstd::mutex &) ()
at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc+±v3/include/x86_64-conda_cos6-linux-gnu/bits/gthr-default.h:877
#2 0x00007fffe8886573 in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) ()
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/condition_variable:98
#3 0x00007fffe88b1d7c in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&, std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocatortorch::autograd::Edge > const&) ()
at torch/csrc/autograd/python_engine.cpp:58
#4 0x00007fffe88b29ae in THPEngine_run_backward(THPEngine*, _object*, _object*) ()
at torch/csrc/autograd/python_engine.cpp:166
#5 0x00007ffff7e4ffd4 in _PyCFunction_FastCallDict ()
#6 0x00007ffff7e7df24 in _PyCFunction_FastCallKeywords ()
#7 0x00007ffff7eddbec in call_function ()
#8 0x00007ffff7f02eb1 in _PyEval_EvalFrameDefault ()
#9 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#10 0x00007ffff7ed7a11 in fast_function ()
#11 0x00007ffff7eddcc5 in call_function ()
#12 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#13 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#14 0x00007ffff7ed7a11 in fast_function ()
#15 0x00007ffff7eddcc5 in call_function ()
#16 0x00007ffff7f02eb1 in _PyEval_EvalFrameDefault ()
#17 0x00007ffff7ed8529 in PyEval_EvalCodeEx ()
#18 0x00007ffff7ed92cc in PyEval_EvalCode ()
#19 0x00007ffff7effa8d in builtin_exec ()
#20 0x00007ffff7e53051 in PyCFunction_Call ()
#21 0x00007ffff7f0742b in _PyEval_EvalFrameDefault ()
#22 0x00007ffff7ed69a6 in _PyEval_EvalCodeWithName ()
#23 0x00007ffff7ed7a11 in fast_function ()
#24 0x00007ffff7eddcc5 in call_function ()
#25 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#26 0x00007ffff7ed77db in fast_function ()
#27 0x00007ffff7eddcc5 in call_function ()
—Type to continue, or q to quit—
#28 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#29 0x00007ffff7ed77db in fast_function ()
#30 0x00007ffff7eddcc5 in call_function ()
#31 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#32 0x00007ffff7ed77db in fast_function ()
#33 0x00007ffff7eddcc5 in call_function ()
#34 0x00007ffff7f0219a in _PyEval_EvalFrameDefault ()
#35 0x00007ffff7ed7e4b in _PyFunction_FastCallDict ()
#36 0x00007ffff7e5039f in _PyObject_FastCallDict ()
#37 0x00007ffff7e93650 in _PyObject_CallMethodIdObjArgs ()
#38 0x00007ffff7e46ea0 in PyImport_ImportModuleLevelObject ()
#39 0x00007ffff7f04b2c in _PyEval_EvalFrameDefault ()
#40 0x00007ffff7ed8529 in PyEval_EvalCodeEx ()
#41 0x00007ffff7ed92cc in PyEval_EvalCode ()
#42 0x00007ffff7f55af4 in run_mod ()
#43 0x00007ffff7e1c930 in PyRun_InteractiveOneObjectEx ()
#44 0x00007ffff7e1cae6 in PyRun_InteractiveLoopFlags ()
#45 0x00007ffff7e1cb86 in PyRun_AnyFileExFlags.cold.2769 ()
#46 0x00007ffff7e1eb69 in Py_Main.cold.2794 ()
#47 0x00007ffff7e2171e in main ()
@apaszke - I wonder if you have any ideas since you seem to have taken a look at a similar case before: https://discuss.pytorch.org/t/solved-archlinux-using-variable-backwards-appears-to-hang-program-indefinitely/1675/11
SimonW
(Simon Wang)
May 29, 2018, 9:53pm
2
what’s the pytorch version?
It’s the latest version - 0.4.0
SimonW
(Simon Wang)
May 30, 2018, 4:26am
4
Could you post a minimal example? Thanks!
Thanks, @SimonW . Below is a MWE that is much simpler than what I had posted earlier. Turns out that the deadlock is caused only when N=1; values greater than 1 don’t seem to cause the deadlock.
import torch
from torch.autograd import Variable
N = 1 # N = 1 CAUSES DEADLOCK, N > 1 DOESN'T
D = 2
x = Variable(torch.randn(N, D), requires_grad=False)
y = Variable(torch.randn(N, D), requires_grad=False)
w1 = Variable(torch.randn(D,D), requires_grad=True)
y_pred = x.mm(w1)
loss = (y_pred - y).pow(2).sum()
learning_rate = 1e-6
print('Check1')
loss.backward(retain_graph=True)
print('Check2')
w1.data -= learning_rate * w1.grad.data
w1.grad.data.zero_()
@Santosh_Manicka could you try building from source and seeing if the problem still persists? I can’t reproduce this on the master branch.
Thank you, @richard , will try that. Are you able to reproduce it though on the binary version (if by any chance you have it)?
I tried it with 0.4.0
(installed via conda) and it works without a deadlock.
(I removed the Variable
part)
Thank you, @ptrblck . I tried removing the Variable
part for x and y, but it still causes a deadlock on mine. So, my installation is on a REHL 6.9 (Santiago) cluster system though, and it’s a gpu version. Also, as I had mentioned earlier, the code works fine on my Windows PC which has a cpu version.
I’m sorry @ptrblck , actually I should have set N=1 in the code, as that’s what causes the deadlock. Could you please try it again?
I’m sorry @richard , actually I should have set N=1 in the code, as that’s what causes the deadlock. Could you please try it again?
I tried it with N=1
and N=2
and both worked.
However, I can test it on another machine later and see, if it gets a deadlock.
I tried with N=1 on pytorch 0.4 and on master (on two different machines).
Hi @ptrblck , would it be possible to post your sys.path here? It has helped me before with a seg fault issue, when I found that the code was picking up numpy from a different installation than anaconda. Thanks.
Sure! Here it is:
['',
'/home/ptrblck/anaconda2/lib/python27.zip',
'/home/ptrblck/anaconda2/lib/python2.7',
'/home/ptrblck/anaconda2/lib/python2.7/plat-linux2',
'/home/ptrblck/anaconda2/lib/python2.7/lib-tk',
'/home/ptrblck/anaconda2/lib/python2.7/lib-old',
'/home/ptrblck/anaconda2/lib/python2.7/lib-dynload',
'/home/ptrblck/.local/lib/python2.7/site-packages',
'/home/ptrblck/anaconda2/lib/python2.7/site-packages',
'/home/ptrblck/anaconda2/lib/python2.7/site-packages/Sphinx-1.5.1-py2.7.egg',
'/home/ptrblck/anaconda2/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg',
'/home/ptrblck/anaconda2/lib/python2.7/site-packages/IPython/extensions',
'/home/ptrblck/.ipython']
I tested it on another machine and it works like in @richard ’s case.
Thanks! Mine is:
['', '/cluster/tufts/software/anaconda3/lib/python36.zip', '/cluster/tufts/software/anaconda3/lib/python3.6', '/cluster/tufts/software/anaconda3/lib/python3.6/lib-dynload', '/cluster/home/smanic02/.local/lib/python3.6/site-packages', '/cluster/tufts/software/anaconda3/lib/python3.6/site-packages', '/cluster/tufts/software/anaconda3/lib/python3.6/site-packages/Mako-1.0.7-py3.6.egg']
Could anaconda3 and python3.6 on my end make a difference? I wonder.
I tested it in my Python3 environment with PyTorch 0.4.0
and it also works.
The sys.path
of this env in case you would like to see it:
['',
'/home/ptrblck/anaconda2/envs/py3/lib/python36.zip',
'/home/ptrblck/anaconda2/envs/py3/lib/python3.6',
'/home/ptrblck/anaconda2/envs/py3/lib/python3.6/lib-dynload',
'/home/ptrblck/anaconda2/envs/py3/lib/python3.6/site-packages',
'/home/ptrblck/anaconda2/envs/py3/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg']
Thanks! I tried it on anaconda/2 python2.7, which our cluster fortunately had, but it still runs into a deadlock.
SimonW
(Simon Wang)
May 30, 2018, 7:51pm
20
It might be environment dependent. It might be better to contact the cluster administrator?
Thanks, @SimonW . I have contacted them, but the cause of the problem is not apparent to them as well. The puzzle is why N = 2,3,… work but not N=1. Could it be that for N>1, some kind of multiprocessing kicks in even when I didn’t want it to?