Tensor multiplication hangs

The below issue seems to be closely related to https://discuss.pytorch.org/t/pytorch-cpu-hangs-on-nn-linear/17748/6

The below code hangs at the last line:

import torch
from torch.autograd import Variable
N, D_in, H = 50, 100, 50
x = Variable(torch.randn(N, D_in), requires_grad=False)
w1 = Variable(torch.randn(D_in, H), requires_grad=True)
y = x.mm(w1)

If I change N,D_in,H to smaller values (e.g., below 50), then it works fine.

Here is the gdb backtrace:

import torch
Missing separate debuginfo for /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
Detaching after fork from child process 34213.
from torch.autograd import Variable
N, D_in, H = 50, 100, 50
x = Variable(torch.randn(N, D_in), requires_grad=False)
w1 = Variable(torch.randn(D_in, H), requires_grad=True)
y = x.mm(w1)
[New Thread 0x7fffba990780 (LWP 35188)]
[New Thread 0x7fffba58f800 (LWP 35189)]
[New Thread 0x7fffba18e880 (LWP 35190)]
[New Thread 0x7fffb9d8d900 (LWP 35193)]
[New Thread 0x7fffb998c980 (LWP 35197)]
[New Thread 0x7fffb958ba00 (LWP 35198)]
[New Thread 0x7fffb918aa80 (LWP 35199)]
[New Thread 0x7fffb8d89b00 (LWP 35200)]
[New Thread 0x7fffb8988b80 (LWP 35201)]
[New Thread 0x7fffb8587c00 (LWP 35202)]
[New Thread 0x7fffb8186c80 (LWP 35203)]
[New Thread 0x7fffb7d85d00 (LWP 35207)]
[New Thread 0x7fffb7984d80 (LWP 35208)]
[New Thread 0x7fffb7583e00 (LWP 35209)]
[New Thread 0x7fffb7182e80 (LWP 35210)]
[New Thread 0x7fffb6d81f00 (LWP 35211)]
[New Thread 0x7fffb6980f80 (LWP 35212)]
[New Thread 0x7fffb6180000 (LWP 35213)]
[New Thread 0x7fffb5d7f080 (LWP 35214)]
^Z
Program received signal SIGTSTP, Stopped (user).
pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162 62: movl (%rsp), %edi
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.x86_64
(gdb) backtrace
#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007ffff1778ce9 in __kmp_suspend_64 ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#2 0x00007ffff1714fc2 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_gather(barrier_type, kmp_info*, int, int, void ()(void, void*), void*) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#3 0x00007ffff1718647 in __kmp_join_barrier(int) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#4 0x00007ffff1747e72 in __kmp_internal_join ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#5 0x00007ffff1747666 in kmp_join_call ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#6 0x00007fffc59ffd81 in mkl_blas_sgemm_omp_driver_v1 ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/…/…/…/…/libmkl_gnu_thread.so
#7 0x00007fffc59d8124 in mkl_blas_sgemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/…/…/…/…/libmkl_gnu_thread.so
#8 0x00007fffbe82d3d3 in sgemm
()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/core/…/…/…/…/libmkl_intel_lp64.so
#9 0x00007ffff5059890 in sgemm
()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/core/…/…/…/…/libmkl_rt.so
#10 0x00007fffc95e03f1 in THFloatBlas_gemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#11 0x00007fffc9292181 in THFloatTensor_addmm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#12 0x00007fffc8f28442 in at::CPUFloatType::_mm(at::Tensor const&, at::Tensor const&) const ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#13 0x00007fffe88a0593 in torch::autograd::VariableType::_mm(at::Tensor const&, at::Tensor const&) const ()
at torch/csrc/autograd/generated/VariableType.cpp:7917
#14 0x00007fffc8e4d769 in at::native::mm(at::Tensor const&, at::Tensor const&) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#15 0x00007fffc913593f in at::Type::mm(at::Tensor const&, at::Tensor const&) const ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#16 0x00007fffe87f7c36 in torch::autograd::VariableType::mm(at::Tensor const&, at::Tensor const&) const ()
at torch/csrc/autograd/generated/VariableType.cpp:20127
#17 0x00007fffe8a546e8 in torch::autograd::THPVariable_mm ()
at /opt/conda/conda-bld/pytorch_1524584710464/work/torch/lib/tmp_install/include/ATen/TensorMethods.h:1066
#18 0x00007ffff7bb8fd4 in _PyCFunction_FastCallDict ()
#19 0x00007ffff7c46bec in call_function ()
#20 0x00007ffff7c6b19a in _PyEval_EvalFrameDefault ()
#21 0x00007ffff7c41529 in PyEval_EvalCodeEx ()
#22 0x00007ffff7c422cc in PyEval_EvalCode ()
#23 0x00007ffff7cbeaf4 in run_mod ()
#24 0x00007ffff7b85930 in PyRun_InteractiveOneObjectEx ()
#25 0x00007ffff7b85ae6 in PyRun_InteractiveLoopFlags ()
#26 0x00007ffff7b85b86 in PyRun_AnyFileExFlags.cold.2769 ()
#27 0x00007ffff7b87b69 in Py_Main.cold.2794 ()
#28 0x00007ffff7b8a71e in main ()

I’m using Pytorch 0.4.0 on a RHEL cluster.

P.S: torch.cuda.is_available() returns False

Is numpy causing the deadlock? @richard – do you have any idea what’s going on here?

Anyone with any ideas on this please? - @colesbury @smth @tom @ptrblck Thanks!

Hm, that computation should only take a fraction of a second, and I am not sure why you see all the verbose output when running y=x.mm(w1). Also, as far as I know, PyTorch doesn’t use NumPy for any of its computations but has it’s own tensor library (from torch, and aten) to do things like matrix multiplication.

Thanks. Any advice as to how I should proceed please?

I don’t have 0.4 to test but the code certainly copy-pastes well to the pytorch “masterish” that I have installed. (CPU only compiled with MKL on Debian.)
I’d indeed probably try to reproduce with self-compiled master.

Best regards

Thomas