[solved][ArchLinux] Using `Variable.backwards` appears to hang program indefinitely

cuevasclemente · April 5, 2017, 8:20pm

I’m having a pretty interesting error around the backwards function on a variable from a very simple network in pytorch.

When I run the following simple program using pytorch, I get some strange behaviour, where the program appears to continue after the Variable.backwards() call, but the program does not actually close, I must manually close the program myself (in this case by using ctrl+c to send SIGTERM). This might be desired behaviour but I’m not sure what I should be doing to prevent it from happening then.

$ cat net_test.py
import torch
import sys
import torch.utils
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(100, 75)
        self.fc2 = nn.Linear(75, 25)
        self.fc3 = nn.Linear(25, 1)

    def forward(self, x):
        x = F.elu(self.fc1(x))
        x = F.elu(self.fc2(x))
        x = self.fc3(x)
        return x


if __name__ == '__main__':
    net = Net()

    inp = Variable(torch.randn(1, 100))

    out = net(inp)

    net.zero_grad()

    out.backward(torch.randn(1, 1))
    print('done')
    sys.exit()

$ python 
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.__version__
'0.1.11+8aa1cef'
>>>
$ time python net_test.py                                                                                                                                                                                                                                                       
done
^C
real    0m17,992s
user    0m0,223s
sys     0m0,037s

Strangely, I was able to get similar code to function in an iPython notebook perfectly fine, and I can actually get code to work that trains a network even with this Variable.backwards call in the code, but the program still shows the same behaviour wherein it does not close on its own.

$ cat net.py                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
import torch
import numpy as np
import random
import sys
import torch.utils
import torch.optim as optim
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(100, 75)
        self.fc2 = nn.Linear(75, 25)
        self.fc3 = nn.Linear(25, 1)

    def forward(self, x):
        x = F.elu(self.fc1(x))
        x = F.elu(self.fc2(x))
        x = self.fc3(x)
        return x


if __name__ == '__main__':
    net = Net()

    inp = Variable(torch.randn(1, 100))

    out = net(inp)

    net.zero_grad()

    out.backward(torch.randn(1, 1))

    alpha = 0.01

    optimizer = optim.SGD(net.parameters(), lr=alpha)

    optimizer.zero_grad()

    good_vecs = [np.random.randn(100).astype('float32') for _ in range(0, 20)]
    bad_vecs = [np.random.randn(100).astype('float32') for _ in range(0, 20)]

    bad_set = [(vec, [-1.0]) for vec in good_vecs]
    good_set = [(vec, [1.0]) for vec in bad_vecs]

    shuffled_data = bad_set + good_set
    random.shuffle(shuffled_data)

    vectors = []
    values = []
    for vector, value in shuffled_data:
        vectors.append(torch.from_numpy(vector))
        values.append(torch.Tensor(value))

    vectors = torch.stack(vectors)
    values = torch.stack(values)

    running_loss = 0.0
    loss = nn.MSELoss()
    shuffled_data = bad_set + good_set
    random.shuffle(shuffled_data)
    for epoch in range(3):

        running_loss = 0.0
        for i in range(0, len(shuffled_data), 4):
            inp = vectors[i:i+5]
            label = values[i:i+5]
            inp, label = Variable(inp), Variable(label)
            optimizer.zero_grad()

            outputs = net(inp)
            this_loss = loss(outputs, label)
            this_loss.backward()
            optimizer.step()

            running_loss += this_loss.data
        print(running_loss)
    print('done')
    sys.exit()
$ time python net.py

 9.9317
[torch.FloatTensor of size 1]


 8.5297
[torch.FloatTensor of size 1]


 7.2956
[torch.FloatTensor of size 1]

done
^C

real    19m41,483s
user    0m0,240s
sys     0m0,037s

Sorry if the numpy interchange stuff is odd, it is just analogous to how I’m using pytorch in a project that is using numpy.

albanD · April 6, 2017, 9:21am

Hi,

Running your code does not cause any problem on my side.
There may be something weird with your python/numpy/pytorch install.

cuevasclemente · April 6, 2017, 3:10pm

Do you know the best way to introspect this? I find it odd too since I know this can’t be normal behaviour. I installed pytorch through the conda instructions on the pytorch front page, but I will try uninstalling and reinstalling it to see if that clears this up.

cuevasclemente · April 6, 2017, 3:45pm

I tried the following and after each, I still see the same behaviour:

Updating pytorch
Uninstalling and reinstalling pytorch
using conda update --all to update all my packages
Uninstalling anaconda using anaconda-clean and then reinstalling it, installing pytorch via conda install pytorch torchvision cuda80 -c soumith

albanD · April 6, 2017, 3:53pm

One thing you could try is to run this script inside gdb and interupt it after it print “done” and see if you get any info from the stacktrace.

cuevasclemente · April 6, 2017, 4:00pm

@albanD unless you suspect that there’s something weird going on threading, there doesn’t appear to be a whole lot:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffb7b78700 (LWP 18149)]
[New Thread 0x7fffb7377700 (LWP 18150)]
[New Thread 0x7fffb6b76700 (LWP 18151)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

but I’d be interested as to why this is happening to me and not anyone else…

albanD · April 6, 2017, 4:10pm

can you type bt in gdb after you interupted the process?

cuevasclemente · April 6, 2017, 4:15pm

sorry:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffb7b78700 (LWP 18437)]
[New Thread 0x7fffb7377700 (LWP 18438)]
[New Thread 0x7fffaffff700 (LWP 18439)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007fffedeaa75e in torch::autograd::ReadyQueue::~ReadyQueue (this=0x112bf20, __in_chrg=<optimized out>)
   from /home/clemente/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#2  std::default_delete<torch::autograd::ReadyQueue>::operator() (this=<optimized out>, __ptr=0x112bf20) at torch/csrc/autograd/engine.cpp:67
#3  std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >::~unique_ptr (this=0x112bec0, __in_chrg=<optimized out>)
    at torch/csrc/autograd/engine.cpp:184
#4  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue> > (__pointer=0x112bec0) at torch/csrc/autograd/engine.cpp:93
#5  std::_Destroy_aux<false>::__destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0x112bed0, __first=0x112bec0) at torch/csrc/autograd/engine.cpp:103
#6  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0x112bed0, __first=<optimized out>) at torch/csrc/autograd/engine.cpp:126
#7  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*, std::unique_ptr<torch::autograd::ReadyQueue> > (__last=0x112bed0, __first=<optimized out>)
    at torch/csrc/autograd/engine.cpp:151
#8  std::vector<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >, std::allocator<std::unique_ptr<torch::autograd::ReadyQueue,std::default_delete<torch::autograd::ReadyQueue> > > >::~vector (this=0x7fffee727ce8 <engine+8>, __in_chrg=<optimized out>) at torch/csrc/autograd/engine.cpp:415
#9  torch::autograd::Engine::~Engine (this=0x7fffee727ce0 <engine>, __in_chrg=<optimized out>) at torch/csrc/autograd/engine.cpp:21
#10 0x00007ffff6a276c0 in __run_exit_handlers () from /usr/lib/libc.so.6
#11 0x00007ffff6a2771a in exit () from /usr/lib/libc.so.6
#12 0x00007ffff7a4ba19 in Py_Exit (sts=0) at Python/pylifecycle.c:1541
#13 0x00007ffff7a4ee82 in handle_system_exit () at Python/pythonrun.c:602
#14 0x00007ffff7a4f12d in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:612
#15 0x00007ffff7a4fa1d in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffdf70) at Python/pythonrun.c:401
#16 0x00007ffff7a6aa41 in run_file (p_cf=0x7fffffffdf70, filename=0x604110 L"net_test.py", fp=0x66ef70) at Modules/main.c:320
#17 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:781
#18 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

albanD · April 6, 2017, 4:39pm

It looks like a deadlock when destroying the autograd Engine
I am not sure what is causing this though… @apaszke will have to step in here.
Its weird indeed that it happens only to you.

cuevasclemente · April 6, 2017, 4:43pm

Thanks for the help so far @albanD!

apaszke · April 6, 2017, 4:47pm

Did you install from source or are you using the binaries? What system are you on?

apaszke · April 6, 2017, 4:49pm

It seems that it’s a bug in glibc@2.3.2.

EDIT: I think that answer meant that it’s a problem in that guy’s code, but I really have no idea other than some stdlibc++ or pthread problem

cuevasclemente · April 6, 2017, 5:26pm

I installed from the binaries using conda.

EDIT: I used conda install pytorch torchvision cuda80 -c soumith after installing conda via the 64-bit installer on the anaconda webpage (https://www.continuum.io/downloads)

I’m on arch linux

$ python --version
Python 3.6.0 :: Anaconda 4.3.1 (64-bit)'

$ pacman -Q glibc
glibc 2.25-1

I can try installing from source and see if that changes anything.

cuevasclemente · April 6, 2017, 6:18pm

Installing from source (I’m using what’s on master right now for pytorch) didn’t change the behaviour but there is some interesting new info in the stacktrace from the debugger:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
warning: File "/home/clemente/anaconda3/lib/libstdc++.so.6.0.19-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /home/clemente/anaconda3/lib/libstdc++.so.6.0.19-gdb.py
line to your configuration file "/home/clemente/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/clemente/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[New Thread 0x7fffe7d8f700 (LWP 22877)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007fffee44450e in torch::autograd::ReadyQueue::~ReadyQueue (this=0xb8ac10, __in_chrg=<optimized out>)
    at torch/csrc/autograd/engine.cpp:36
#2  std::default_delete<torch::autograd::ReadyQueue>::operator() (this=<optimized out>, __ptr=0xb8ac10)
    at /home/clemente/anaconda3/gcc/include/c++/bits/unique_ptr.h:67
#3  std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >::~unique_ptr (
    this=0xb8abf0, __in_chrg=<optimized out>) at /home/clemente/anaconda3/gcc/include/c++/bits/unique_ptr.h:184
#4  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue> > (__pointer=0xb8abf0)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:93
#5  std::_Destroy_aux<false>::__destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0xb8abf8, __first=0xb8abf0)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:103
#6  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0xb8abf8, __first=<optimized out>)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:126
#7  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*, std::unique_ptr<torch::autograd::ReadyQueue> > (
    __last=0xb8abf8, __first=<optimized out>) at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:151
#8  std::vector<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >, std::allocator<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> > > >::~vector (
    this=0x7fffee783248 <engine+8>, __in_chrg=<optimized out>)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_vector.h:415
#9  torch::autograd::Engine::~Engine (this=0x7fffee783240 <engine>, __in_chrg=<optimized out>)
    at /home/clemente/src/pytorch/torch/csrc/autograd/engine.h:21
#10 0x00007ffff6a276c0 in __run_exit_handlers () from /usr/lib/libc.so.6
#11 0x00007ffff6a2771a in exit () from /usr/lib/libc.so.6
#12 0x00007ffff7a4ba19 in Py_Exit (sts=0) at Python/pylifecycle.c:1541
#13 0x00007ffff7a4ee82 in handle_system_exit () at Python/pythonrun.c:602
#14 0x00007ffff7a4f12d in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:612
#15 0x00007ffff7a4fa1d in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>,
    flags=0x7fffffffe320) at Python/pythonrun.c:401
#16 0x00007ffff7a6aa41 in run_file (p_cf=0x7fffffffe320, filename=0x604110 L"net_test.py", fp=0x6615d0) at Modules/main.c:320
#17 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:781
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) q
A debugging session is active.

        Inferior 1 [process 22871] will be killed.

Quit anyway? (y or n) y

It seems like gdb is complaining about not being able to access the version of libstdc++ installed by conda. It is possible that pytorch is also using my host version of libstdc++?

cuevasclemente · April 6, 2017, 6:53pm

I was able to get this program to run successfully in a virtualized ubuntu environment…

It could have something to do with the libraries that pytorch is depending on

cuevasclemente · April 6, 2017, 7:21pm

Could it have something to do with the fact that the versions of glibc and libpthread shown in the stack traces are not the ones provided by anaconda? Do you know how to encourage the python runtime to use the anaconda provided versions of those libraries?

cuevasclemente · April 6, 2017, 10:15pm

@apaszke

I think it seems that the biggest difference that I could understand to make a difference between the ubuntu system the code works fine on and the arch system where it hangs is that the arch version is running off glibc 2.25 and the ubuntu version is running off of glibc 2.23, each with their respective versions of libpthread.

I realize now that conda doesn’t pack its own version of glibc really, and I’m not sure how one would test building pytorch against different versions of glibc on the same system.

If this is the case (which it’s too soon to say), anyone running a version of glibc >=2.25 is going to have problems running pytorch.

bunelr · April 7, 2017, 1:30pm

For what it is worth, I have also an Arch install and I can reproduce the problem.
Last time I used my Arch machine for pytorch was two weeks ago and everything was fine (notably, I could run the autograd tests). If I try to run them now, I see the same behaviour described in this issue (everything running but the program not exiting at the end.

Digging through my install logs, I found this:
[2017-04-01 17:21] [ALPM] upgraded glibc (2.24-2 -> 2.25-1)

So it really seems like glibc is problem here.

apaszke · April 9, 2017, 8:13pm

Did any of you tried rebuilding PyTorch since the update? Maybe the C++ interface/headers were updated too?

bunelr · April 10, 2017, 5:06pm

My build of pytorch is probably more recent than the update but I can give it a try when i get to my laptop tonight.

Alban found this tweet of somebody seeing something that might be similar (https://twitter.com/pchapuis/status/842738509005934594)