[solved][ArchLinux] Using `Variable.backwards` appears to hang program indefinitely

@albanD unless you suspect that there’s something weird going on threading, there doesn’t appear to be a whole lot:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffb7b78700 (LWP 18149)]
[New Thread 0x7fffb7377700 (LWP 18150)]
[New Thread 0x7fffb6b76700 (LWP 18151)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

but I’d be interested as to why this is happening to me and not anyone else…

can you type bt in gdb after you interupted the process?

sorry:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffb7b78700 (LWP 18437)]
[New Thread 0x7fffb7377700 (LWP 18438)]
[New Thread 0x7fffaffff700 (LWP 18439)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007fffedeaa75e in torch::autograd::ReadyQueue::~ReadyQueue (this=0x112bf20, __in_chrg=<optimized out>)
   from /home/clemente/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#2  std::default_delete<torch::autograd::ReadyQueue>::operator() (this=<optimized out>, __ptr=0x112bf20) at torch/csrc/autograd/engine.cpp:67
#3  std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >::~unique_ptr (this=0x112bec0, __in_chrg=<optimized out>)
    at torch/csrc/autograd/engine.cpp:184
#4  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue> > (__pointer=0x112bec0) at torch/csrc/autograd/engine.cpp:93
#5  std::_Destroy_aux<false>::__destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0x112bed0, __first=0x112bec0) at torch/csrc/autograd/engine.cpp:103
#6  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0x112bed0, __first=<optimized out>) at torch/csrc/autograd/engine.cpp:126
#7  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*, std::unique_ptr<torch::autograd::ReadyQueue> > (__last=0x112bed0, __first=<optimized out>)
    at torch/csrc/autograd/engine.cpp:151
#8  std::vector<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >, std::allocator<std::unique_ptr<torch::autograd::ReadyQueue,std::default_delete<torch::autograd::ReadyQueue> > > >::~vector (this=0x7fffee727ce8 <engine+8>, __in_chrg=<optimized out>) at torch/csrc/autograd/engine.cpp:415
#9  torch::autograd::Engine::~Engine (this=0x7fffee727ce0 <engine>, __in_chrg=<optimized out>) at torch/csrc/autograd/engine.cpp:21
#10 0x00007ffff6a276c0 in __run_exit_handlers () from /usr/lib/libc.so.6
#11 0x00007ffff6a2771a in exit () from /usr/lib/libc.so.6
#12 0x00007ffff7a4ba19 in Py_Exit (sts=0) at Python/pylifecycle.c:1541
#13 0x00007ffff7a4ee82 in handle_system_exit () at Python/pythonrun.c:602
#14 0x00007ffff7a4f12d in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:612
#15 0x00007ffff7a4fa1d in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffdf70) at Python/pythonrun.c:401
#16 0x00007ffff7a6aa41 in run_file (p_cf=0x7fffffffdf70, filename=0x604110 L"net_test.py", fp=0x66ef70) at Modules/main.c:320
#17 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:781
#18 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

It looks like a deadlock when destroying the autograd Engine :confused:
I am not sure what is causing this though… @apaszke will have to step in here.
Its weird indeed that it happens only to you.

Thanks for the help so far @albanD!

Did you install from source or are you using the binaries? What system are you on?

It seems that it’s a bug in glibc@2.3.2.

EDIT: I think that answer meant that it’s a problem in that guy’s code, but I really have no idea other than some stdlibc++ or pthread problem :confused:

I installed from the binaries using conda.

EDIT: I used conda install pytorch torchvision cuda80 -c soumith after installing conda via the 64-bit installer on the anaconda webpage (https://www.continuum.io/downloads)

I’m on arch linux

$ python --version
Python 3.6.0 :: Anaconda 4.3.1 (64-bit)'

$ pacman -Q glibc
glibc 2.25-1

I can try installing from source and see if that changes anything.

Installing from source (I’m using what’s on master right now for pytorch) didn’t change the behaviour but there is some interesting new info in the stacktrace from the debugger:

$ gdb python
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run net_test.py
Starting program: /home/clemente/anaconda3/bin/python net_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
warning: File "/home/clemente/anaconda3/lib/libstdc++.so.6.0.19-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /home/clemente/anaconda3/lib/libstdc++.so.6.0.19-gdb.py
line to your configuration file "/home/clemente/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/clemente/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[New Thread 0x7fffe7d8f700 (LWP 22877)]
done
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007ffff76c4299 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007fffee44450e in torch::autograd::ReadyQueue::~ReadyQueue (this=0xb8ac10, __in_chrg=<optimized out>)
    at torch/csrc/autograd/engine.cpp:36
#2  std::default_delete<torch::autograd::ReadyQueue>::operator() (this=<optimized out>, __ptr=0xb8ac10)
    at /home/clemente/anaconda3/gcc/include/c++/bits/unique_ptr.h:67
#3  std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >::~unique_ptr (
    this=0xb8abf0, __in_chrg=<optimized out>) at /home/clemente/anaconda3/gcc/include/c++/bits/unique_ptr.h:184
#4  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue> > (__pointer=0xb8abf0)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:93
#5  std::_Destroy_aux<false>::__destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0xb8abf8, __first=0xb8abf0)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:103
#6  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*> (__last=0xb8abf8, __first=<optimized out>)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:126
#7  std::_Destroy<std::unique_ptr<torch::autograd::ReadyQueue>*, std::unique_ptr<torch::autograd::ReadyQueue> > (
    __last=0xb8abf8, __first=<optimized out>) at /home/clemente/anaconda3/gcc/include/c++/bits/stl_construct.h:151
#8  std::vector<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> >, std::allocator<std::unique_ptr<torch::autograd::ReadyQueue, std::default_delete<torch::autograd::ReadyQueue> > > >::~vector (
    this=0x7fffee783248 <engine+8>, __in_chrg=<optimized out>)
    at /home/clemente/anaconda3/gcc/include/c++/bits/stl_vector.h:415
#9  torch::autograd::Engine::~Engine (this=0x7fffee783240 <engine>, __in_chrg=<optimized out>)
    at /home/clemente/src/pytorch/torch/csrc/autograd/engine.h:21
#10 0x00007ffff6a276c0 in __run_exit_handlers () from /usr/lib/libc.so.6
#11 0x00007ffff6a2771a in exit () from /usr/lib/libc.so.6
#12 0x00007ffff7a4ba19 in Py_Exit (sts=0) at Python/pylifecycle.c:1541
#13 0x00007ffff7a4ee82 in handle_system_exit () at Python/pythonrun.c:602
#14 0x00007ffff7a4f12d in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:612
#15 0x00007ffff7a4fa1d in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>,
    flags=0x7fffffffe320) at Python/pythonrun.c:401
#16 0x00007ffff7a6aa41 in run_file (p_cf=0x7fffffffe320, filename=0x604110 L"net_test.py", fp=0x6615d0) at Modules/main.c:320
#17 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:781
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) q
A debugging session is active.

        Inferior 1 [process 22871] will be killed.

Quit anyway? (y or n) y

It seems like gdb is complaining about not being able to access the version of libstdc++ installed by conda. It is possible that pytorch is also using my host version of libstdc++?

I was able to get this program to run successfully in a virtualized ubuntu environment…

It could have something to do with the libraries that pytorch is depending on

Could it have something to do with the fact that the versions of glibc and libpthread shown in the stack traces are not the ones provided by anaconda? Do you know how to encourage the python runtime to use the anaconda provided versions of those libraries?

@apaszke

I think it seems that the biggest difference that I could understand to make a difference between the ubuntu system the code works fine on and the arch system where it hangs is that the arch version is running off glibc 2.25 and the ubuntu version is running off of glibc 2.23, each with their respective versions of libpthread.

I realize now that conda doesn’t pack its own version of glibc really, and I’m not sure how one would test building pytorch against different versions of glibc on the same system.

If this is the case (which it’s too soon to say), anyone running a version of glibc >=2.25 is going to have problems running pytorch.

For what it is worth, I have also an Arch install and I can reproduce the problem.
Last time I used my Arch machine for pytorch was two weeks ago and everything was fine (notably, I could run the autograd tests). If I try to run them now, I see the same behaviour described in this issue (everything running but the program not exiting at the end.

Digging through my install logs, I found this:
[2017-04-01 17:21] [ALPM] upgraded glibc (2.24-2 -> 2.25-1)

So it really seems like glibc is problem here.

Did any of you tried rebuilding PyTorch since the update? Maybe the C++ interface/headers were updated too?

My build of pytorch is probably more recent than the update but I can give it a try when i get to my laptop tonight.

Alban found this tweet of somebody seeing something that might be similar (https://twitter.com/pchapuis/status/842738509005934594)

I am also experiencing this issue on Fedora 26 with glibc 2.25. I just built pytorch this morning, and haven’t upgraded any packages on my system (including glibc) since building pytorch. I created this GitHub issue: https://github.com/pytorch/pytorch/issues/1233

i’m having the same issue on Manjaro. PyTorch was working fine last week.

Are you using binary installs? This has been fixed in master.

I can confirm that the problem no longer occurs for me now using master. I would tag this topic solved but I don’t know if @c4n is having an issue that is related.

This problem is not an issue for me anymore. Thanks!