Tensor multiplication hangs

Santosh_Manicka · May 25, 2018, 1:40pm

The below issue seems to be closely related to https://discuss.pytorch.org/t/pytorch-cpu-hangs-on-nn-linear/17748/6

The below code hangs at the last line:

import torch
from torch.autograd import Variable
N, D_in, H = 50, 100, 50
x = Variable(torch.randn(N, D_in), requires_grad=False)
w1 = Variable(torch.randn(D_in, H), requires_grad=True)
y = x.mm(w1)

If I change N,D_in,H to smaller values (e.g., below 50), then it works fine.

Here is the gdb backtrace:

import torch
Missing separate debuginfo for /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
Detaching after fork from child process 34213.
from torch.autograd import Variable
N, D_in, H = 50, 100, 50
x = Variable(torch.randn(N, D_in), requires_grad=False)
w1 = Variable(torch.randn(D_in, H), requires_grad=True)
y = x.mm(w1)
[New Thread 0x7fffba990780 (LWP 35188)]
[New Thread 0x7fffba58f800 (LWP 35189)]
[New Thread 0x7fffba18e880 (LWP 35190)]
[New Thread 0x7fffb9d8d900 (LWP 35193)]
[New Thread 0x7fffb998c980 (LWP 35197)]
[New Thread 0x7fffb958ba00 (LWP 35198)]
[New Thread 0x7fffb918aa80 (LWP 35199)]
[New Thread 0x7fffb8d89b00 (LWP 35200)]
[New Thread 0x7fffb8988b80 (LWP 35201)]
[New Thread 0x7fffb8587c00 (LWP 35202)]
[New Thread 0x7fffb8186c80 (LWP 35203)]
[New Thread 0x7fffb7d85d00 (LWP 35207)]
[New Thread 0x7fffb7984d80 (LWP 35208)]
[New Thread 0x7fffb7583e00 (LWP 35209)]
[New Thread 0x7fffb7182e80 (LWP 35210)]
[New Thread 0x7fffb6d81f00 (LWP 35211)]
[New Thread 0x7fffb6980f80 (LWP 35212)]
[New Thread 0x7fffb6180000 (LWP 35213)]
[New Thread 0x7fffb5d7f080 (LWP 35214)]
^Z
Program received signal SIGTSTP, Stopped (user).
pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162 62: movl (%rsp), %edi
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.x86_64
(gdb) backtrace
#0 pthread_cond_wait@@GLIBC_2.3.2 () at …/nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007ffff1778ce9 in __kmp_suspend_64 ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#2 0x00007ffff1714fc2 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_gather(barrier_type, kmp_info*, int, int, void ()(void, void*), void*) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#3 0x00007ffff1718647 in __kmp_join_barrier(int) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#4 0x00007ffff1747e72 in __kmp_internal_join ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#5 0x00007ffff1747666 in kmp_join_call ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/…/…/…/libiomp5.so
#6 0x00007fffc59ffd81 in mkl_blas_sgemm_omp_driver_v1 ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/…/…/…/…/libmkl_gnu_thread.so
#7 0x00007fffc59d8124 in mkl_blas_sgemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/…/…/…/…/libmkl_gnu_thread.so
#8 0x00007fffbe82d3d3 in sgemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/core/…/…/…/…/libmkl_intel_lp64.so
#9 0x00007ffff5059890 in sgemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/numpy/core/…/…/…/…/libmkl_rt.so
#10 0x00007fffc95e03f1 in THFloatBlas_gemm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#11 0x00007fffc9292181 in THFloatTensor_addmm ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#12 0x00007fffc8f28442 in at::CPUFloatType::_mm(at::Tensor const&, at::Tensor const&) const ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#13 0x00007fffe88a0593 in torch::autograd::VariableType::_mm(at::Tensor const&, at::Tensor const&) const ()
at torch/csrc/autograd/generated/VariableType.cpp:7917
#14 0x00007fffc8e4d769 in at::native::mm(at::Tensor const&, at::Tensor const&) ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#15 0x00007fffc913593f in at::Type::mm(at::Tensor const&, at::Tensor const&) const ()
from /cluster/tufts/software/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so
#16 0x00007fffe87f7c36 in torch::autograd::VariableType::mm(at::Tensor const&, at::Tensor const&) const ()
at torch/csrc/autograd/generated/VariableType.cpp:20127
#17 0x00007fffe8a546e8 in torch::autograd::THPVariable_mm ()
at /opt/conda/conda-bld/pytorch_1524584710464/work/torch/lib/tmp_install/include/ATen/TensorMethods.h:1066
#18 0x00007ffff7bb8fd4 in _PyCFunction_FastCallDict ()
#19 0x00007ffff7c46bec in call_function ()
#20 0x00007ffff7c6b19a in _PyEval_EvalFrameDefault ()
#21 0x00007ffff7c41529 in PyEval_EvalCodeEx ()
#22 0x00007ffff7c422cc in PyEval_EvalCode ()
#23 0x00007ffff7cbeaf4 in run_mod ()
#24 0x00007ffff7b85930 in PyRun_InteractiveOneObjectEx ()
#25 0x00007ffff7b85ae6 in PyRun_InteractiveLoopFlags ()
#26 0x00007ffff7b85b86 in PyRun_AnyFileExFlags.cold.2769 ()
#27 0x00007ffff7b87b69 in Py_Main.cold.2794 ()
#28 0x00007ffff7b8a71e in main ()

I’m using Pytorch 0.4.0 on a RHEL cluster.

P.S: torch.cuda.is_available() returns False

Is numpy causing the deadlock? @richard – do you have any idea what’s going on here?

Anyone with any ideas on this please? - @colesbury @smth @tom @ptrblck Thanks!

rasbt · May 26, 2018, 12:29pm

Hm, that computation should only take a fraction of a second, and I am not sure why you see all the verbose output when running y=x.mm(w1). Also, as far as I know, PyTorch doesn’t use NumPy for any of its computations but has it’s own tensor library (from torch, and aten) to do things like matrix multiplication.

Santosh_Manicka · May 28, 2018, 12:11am

Thanks. Any advice as to how I should proceed please?

tom · May 31, 2018, 10:04pm

I don’t have 0.4 to test but the code certainly copy-pastes well to the pytorch “masterish” that I have installed. (CPU only compiled with MKL on Debian.)
I’d indeed probably try to reproduce with self-compiled master.

Best regards

Thomas