Searching online suggests this hangs to be due to CUDA incompatibility issues, but I have no CUDA installed and am using the torch-cpu install. I also tried exporting
(fairseq) bash-4.1$ gdb --args python
build/ fairseq.egg-info/ .interactive.py.swp README.md tests/
CONTRIBUTING.md fairseq.gif LICENSE requirements.txt train.py
data/ generate.py multiprocessing_train.py score.py wmt14.en-de.fconv-py/
distributed_train.py .git/ PATENTS scripts/ wmt14.en-fr.fconv-py/
example.py .gitignore preprocess.py setup.py
fairseq/ interactive.py __pycache__/ singleprocess_train.py
(fairseq) bash-4.1$ gdb --args python example.py
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-90.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/bin/python...done.
(gdb) run
Starting program: /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/bin/python example.py
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Detaching after fork from child process 24591.
[New Thread 0x7fffe535c780 (LWP 29388)]
[New Thread 0x7fffe4f5b800 (LWP 29396)]
[New Thread 0x7fffe4b5a880 (LWP 29400)]
[New Thread 0x7fffe4759900 (LWP 29405)]
[New Thread 0x7fffe4358980 (LWP 29409)]
[New Thread 0x7fffe3f57a00 (LWP 29414)]
[New Thread 0x7fffe3b56a80 (LWP 29418)]
[New Thread 0x7fffe3755b00 (LWP 29423)]
[New Thread 0x7fffe3354b80 (LWP 29427)]
[New Thread 0x7fffe2f53c00 (LWP 29432)]
[New Thread 0x7fffe2b52c80 (LWP 29436)]
Before
^C
Program received signal SIGINT, Interrupt.
pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162 62: movl (%rsp), %edi
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) backtrace
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007ffff2083ce9 in __kmp_suspend_64 ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#2 0x00007ffff201ffc2 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_gather(barrier_type, kmp_info*, int, int, void (*)(void*, void*), void*) ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#3 0x00007ffff2023647 in __kmp_join_barrier(int) ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#4 0x00007ffff2052e72 in __kmp_internal_join ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#5 0x00007ffff2052666 in __kmp_join_call ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#6 0x00007ffff20267f7 in __kmpc_fork_call ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/../../../libiomp5.so
#7 0x00007fffeeeeb157 in mkl_blas_sgemv_omp ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#8 0x00007fffeed67208 in mkl_blas_sgemv ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#9 0x00007fffeeea1d68 in mkl_blas_sgemm ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_thread.so
#10 0x00007ffff0fc32f1 in sgemm_ ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/mkl_fft/../../../libmkl_intel_lp64.so
#11 0x00007ffff5a06890 in sgemm_ ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/numpy/core/../../../../libmkl_rt.so
#12 0x00007fffe90acd11 in THFloatBlas_gemm ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libATen.so
#13 0x00007fffe8d5eaa1 in THFloatTensor_addmm ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libATen.so
#14 0x00007fffe8b5dc29 in at::CPUFloatType::s_addmm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Scalar, at::Scalar) const ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libATen.so
#15 0x00007fffe9e034e2 in torch::autograd::VariableType::s_addmm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Scalar, at::Scalar) const () at torch/csrc/autograd/generated/VariableType.cpp:7500
#16 0x00007fffe8c2f2f8 in at::Type::addmm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Scalar, at::Scalar) const ()
from /gpfs/nlu/data/filesets/projects/here-work/sergey_mkrtchyan/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libATen.so
#17 0x00007fffe9ebc076 in torch::autograd::THPVariable_addmm(_object*, _object*, _object*) ()
at /opt/conda/conda-bld/pytorch-cpu_1524582300956/work/torch/lib/tmp_install/include/ATen/TensorMethods.h:682
#18 0x00007ffff7e4eb94 in _PyCFunction_FastCallDict ()
#19 0x00007ffff7ede67c in call_function ()
#20 0x00007ffff7f00cba in _PyEval_EvalFrameDefault ()
#21 0x00007ffff7ed7a94 in _PyEval_EvalCodeWithName ()
#22 0x00007ffff7ed8941 in fast_function ()
#23 0x00007ffff7ede755 in call_function ()
#24 0x00007ffff7f00cba in _PyEval_EvalFrameDefault ()
#25 0x00007ffff7ed8d7b in _PyFunction_FastCallDict ()
#26 0x00007ffff7e4ef5f in _PyObject_FastCallDict ()
#27 0x00007ffff7e53a03 in _PyObject_Call_Prepend ()
#28 0x00007ffff7e4e99e in PyObject_Call ()
#29 0x00007ffff7f02470 in _PyEval_EvalFrameDefault ()
#30 0x00007ffff7ed7a94 in _PyEval_EvalCodeWithName ()
#31 0x00007ffff7ed8e1b in _PyFunction_FastCallDict ()
#32 0x00007ffff7e4ef5f in _PyObject_FastCallDict ()
#33 0x00007ffff7e53a03 in _PyObject_Call_Prepend ()
#34 0x00007ffff7e4e99e in PyObject_Call ()
#35 0x00007ffff7eab9b7 in slot_tp_call ()
#36 0x00007ffff7e4ed7b in _PyObject_FastCallDict ()
#37 0x00007ffff7ede7ce in call_function ()
#38 0x00007ffff7f00cba in _PyEval_EvalFrameDefault ()
#39 0x00007ffff7ed8d7b in _PyFunction_FastCallDict ()
#40 0x00007ffff7e4ef5f in _PyObject_FastCallDict ()
#41 0x00007ffff7e53a03 in _PyObject_Call_Prepend ()
#42 0x00007ffff7e4e99e in PyObject_Call ()
#43 0x00007ffff7f02470 in _PyEval_EvalFrameDefault ()
#44 0x00007ffff7ed7a94 in _PyEval_EvalCodeWithName ()
#45 0x00007ffff7ed8e1b in _PyFunction_FastCallDict ()
#46 0x00007ffff7e4ef5f in _PyObject_FastCallDict ()
#47 0x00007ffff7e53a03 in _PyObject_Call_Prepend ()
#48 0x00007ffff7e4e99e in PyObject_Call ()
#49 0x00007ffff7eab9b7 in slot_tp_call ()
#50 0x00007ffff7e4ed7b in _PyObject_FastCallDict ()
#51 0x00007ffff7ede7ce in call_function ()
#52 0x00007ffff7f00cba in _PyEval_EvalFrameDefault ()
#53 0x00007ffff7ed9459 in PyEval_EvalCodeEx ()
#54 0x00007ffff7eda1ec in PyEval_EvalCode ()
#55 0x00007ffff7f549a4 in run_mod ()
#56 0x00007ffff7f54da1 in PyRun_FileExFlags ()
#57 0x00007ffff7f54fa4 in PyRun_SimpleFileExFlags ()
#58 0x00007ffff7f58a9e in Py_Main ()
#59 0x00007ffff7e204be in main ()
Are you using multiprocessing? I’ve run into deadlock issues with multiprocessing and OpenMP. Here are a few things to try:
Try adding import multprocessing; multiprocessing.set_start_method('spawn') at the very beginning of your program. The default (“fork”) can have problems with threads.
Try running with the environment variable OMP_NUM_THREADS=1. This may be slower, but I think should avoid the OpenMP deadlocks.
PyTorch uses OpenMP by default (not multiprocessing). I see your backtrace the conda environment is named “fairseq”. I mentioned multiprocessing because I know the fairseq project uses multiprocessing.
You are right, I originally ran into this issue while debugging fairseq, but oddly enough I get the same deadlock while running the simple pytorch example here.