Crashes when using torch.max() in torch==1.6.0

Stephane_Bersier · August 19, 2020, 2:25pm

I’m using Linux, Ubuntu 18.04 with python3.6.9 and have a problem with the torch.max() function when using torch==1.6.0.

With torch==1.5.0. It works correctly:

steph@steph-desktop:~$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.5.0’
a=torch.randn(1,3,5)
a.max(1)
torch.return_types.max(
values=tensor([[ 1.4852, 1.0638, -0.7425, -0.4036, 0.8044]]),
indices=tensor([[1, 2, 0, 0, 2]]))
quit()

Then I pip3 uninstall torch==1.5.0 and pip3 install torch==1.6.0 --user
Then I do the same thing as above:

steph@steph-desktop:~$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.6.0’
a=torch.randn(1,3,5)
a.max(1)
Illegal instruction (core dumped)

Then I pip3 unstall torch==1.6.0 and pip3 install torch==1.5.0 --user again.

steph@steph-desktop:~$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.5.0’
a=torch.randn(1,3,5)
a.max(1)
torch.return_types.max(
values=tensor([[ 0.2041, 1.7543, 0.4384, 0.7029, -0.3107]]),
indices=tensor([[1, 1, 1, 0, 0]]))
quit()
steph@steph-desktop:~$

That is, we are back to normal.

Then I went on Google colab. Created a jupyter notebook. Imported torch. The imported version is torch==1.6.0. And there… it works.
So, is it a problem on my side or is it a problem with the torch==1.6.0 version for linux?

albanD · August 19, 2020, 4:08pm

Hi,

It does work fine on my side as well…

Do you have any specific configurations on your machine for libraries like openMP, blas, libc?
Also what kind of CPU do you have?

Can you try with a different version of python to see if it changes anything as well?

Stephane_Bersier · August 19, 2020, 5:34pm

Moving from python3.6.9 to python3.7.5 doesn’t help.

With torch==1.6.0 and python3.7.5:

steph@steph-desktop:~$ python3.7
Python 3.7.5 (default, Nov 7 2019, 10:50:52)
[GCC 8.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.6.0’
a=torch.randn(1,3,5)
a.max(1)
Illegal instruction (core dumped)

While with torch==1.5.0 and python3.7.5:

steph@steph-desktop:~$ python3.7
Python 3.7.5 (default, Nov 7 2019, 10:50:52)
[GCC 8.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.5.0’
a=torch.randn(1,3,5)
a.max(1)
torch.return_types.max(
values=tensor([[ 1.0653, 1.4195, 2.4267, -0.1173, 1.0128]]),
indices=tensor([[1, 0, 1, 0, 1]]))
quit()
steph@steph-desktop:~$

LIBRARIES, KERNEL AND PROCESSOR:

Libraries:
(No “fancy” installs for these libraries)
libc6: version 2.27-3ubuntu1.2
libopenmp: version 2.1.1-8
liblas3: version 3.7.1-4ubuntu1

Kernel:
steph@steph-desktop:~$ uname -a
Linux steph-desktop 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

For the cpu:
Indeed, the problem might well be my (a bit old) processor which doesn’t support the SSE4 instruction set but only SSE2.
Does the processor have to support the SSE4 instruction set for torch.max() function to work properly?

I know that I can’t install tensorflow because of incompatibility of my cpu with SSE4.

steph@steph-desktop:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 16
Model: 10
Model name: AMD Phenom™ II X6 1065T Processor
Stepping: 0
CPU MHz: 1560.909
CPU max MHz: 2900.0000
CPU min MHz: 800.0000
BogoMIPS: 5827.05
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 6144K
NUMA node0 CPU(s): 0-5
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter

malfet · August 19, 2020, 9:12pm

Hi,
Can you please try running your code with ATEN_CPU_CAPABILITY=default?
Also, if you have gdb installed on your machine, can you please run:

gdb /usr/bin/python -ex "set args -c 'import torch;print(torch.rand(1,2,5).max(1))'" -ex "run" -ex "bt"

And share the output here?

Stephane_Bersier · August 19, 2020, 9:51pm

Hi,
Regarding ATEN_CPU_CAPABILITY:

steph@steph-desktop:~$ export ATEN_CPU_CAPABILITY=default
steph@steph-desktop:~$ echo $ATEN_CPU_CAPABILITY
default
steph@steph-desktop:~$ python3.7
Python 3.7.5 (default, Nov 7 2019, 10:50:52)
[GCC 8.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘1.6.0’
a=torch.randn(1,3,5)
a.max(1)
Illegal instruction (core dumped)

Regarding gdb output:
For python3.7.5 and torch==1.6.0 the gdb output is:

steph@steph-desktop:~$ gdb python3.7 -ex “set args -c ‘import torch;print(torch.rand(1,2,5).max(1))’” -ex “run” -ex “bt”
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later …
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:

. Find the GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Registered pretty printers for UE4 classes Reading symbols from python3.7...(no debugging symbols found)...done. Starting program: /usr/bin/python3.7 -c 'import torch;print(torch.rand(1,2,5).max(1))' [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffa5778700 (LWP 3552)] [New Thread 0x7fffa4f77700 (LWP 3553)] [New Thread 0x7fffa2776700 (LWP 3554)] [New Thread 0x7fff9df75700 (LWP 3555)] [New Thread 0x7fff9b774700 (LWP 3556)]

Thread 1 “python3.7” received signal SIGILL, Illegal instruction.
0x00007fffe4c04409 in at::TensorIteratorConfig::declare_static_shape(c10::ArrayRef, long) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#0 0x00007fffe4c04409 in at::TensorIteratorConfig::declare_static_shape(c10::ArrayRef, long) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#1 0x00007fffe54e66f7 in at::native::(anonymous namespace)::max_kernel_impl ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#2 0x00007fffe449d83e in void at::native::DispatchStub<void ()(at::Tensor&, at::Tensor&, at::Tensor const&, long, bool), at::native::max_stub>::operator()<at::Tensor&, at::Tensor&, at::Tensor const&, long&, bool&>(c10::DeviceType, at::Tensor&, at::Tensor&, at::Tensor const&, long&, bool&) () from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3 0x00007fffe4499bd2 in at::native::max_out(at::Tensor&, at::Tensor&, at::Tensor const&, long, bool) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#4 0x00007fffe449a77f in at::native::max(at::Tensor const&, long, bool) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007fffe48d3047 in at::TypeDefault::max_dim(at::Tensor const&, long, bool) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#6 0x00007fffe4751544 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor> ()(at::Tensor const&, long, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, long, bool> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&, long, bool)>::call(c10::OperatorKernel*, at::Tensor const&, long, bool) () from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#7 0x00007fffe47ebf54 in at::max(at::Tensor const&, long, bool) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#8 0x00007fffe648b9e2 in torch::autograd::VariableType::max_dim(at::Tensor const&, long, bool) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#9 0x00007fffe4751544 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tup—Type to continue, or q to quit—
le<at::Tensor, at::Tensor> ()(at::Tensor const&, long, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, long, bool> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&, long, bool)>::call(c10::OperatorKernel, at::Tensor const&, long, bool) () from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffe4974944 in at::Tensor::max(long, bool) const ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007ffff3de9d33 in torch::autograd::THPVariable_max(_object*, _object*, _object*) ()
from /home/steph/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#12 0x00000000004d6194 in _PyMethodDescr_FastCallKeywords ()
#13 0x0000000000551d89 in _PyEval_EvalFrameDefault ()
#14 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#15 0x0000000000530aef in PyRun_StringFlags ()
#16 0x000000000063138d in PyRun_SimpleStringFlags ()
#17 0x000000000065473d in ?? ()
#18 0x000000000065486e in _Py_UnixMain ()
#19 0x00007ffff7a05b97 in __libc_start_main (main=0x4b84d0 , argc=3, argv=0x7fffffffdba8, init=,
fini=, rtld_fini=, stack_end=0x7fffffffdb98) at …/csu/libc-start.c:310
#20 0x00000000005df80a in _start ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 3542] will be killed.

Quit anyway? (y or n) y
steph@steph-desktop:~$

While with python3.7.5 and torch==1.5.0 installed, the gdb output is:

steph@steph-desktop:~$ gdb python3.7 -ex “set args -c ‘import torch;print(torch.rand(1,2,5).max(1))’” -ex “run” -ex “bt”
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later (link)
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
(link).
Find the GDB manual and other documentation resources online at: (link)
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Registered pretty printers for UE4 classes
Reading symbols from python3.7…(no debugging symbols found)…done.
Starting program: /usr/bin/python3.7 -c ‘import torch;print(torch.rand(1,2,5).max(1))’
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[New Thread 0x7fffa59f8700 (LWP 3197)]
[New Thread 0x7fffa51f7700 (LWP 3198)]
[New Thread 0x7fffa09f6700 (LWP 3199)]
[New Thread 0x7fff9e1f5700 (LWP 3200)]
[New Thread 0x7fff9b9f4700 (LWP 3201)]
[New Thread 0x7fff92dcd700 (LWP 3203)]
[New Thread 0x7fff925cc700 (LWP 3204)]
[New Thread 0x7fff91dcb700 (LWP 3205)]
[New Thread 0x7fff915ca700 (LWP 3206)]
[New Thread 0x7fff90dc9700 (LWP 3207)]
torch.return_types.max(
values=tensor([[0.3803, 0.9634, 0.5299, 0.9652, 0.6461]]),
indices=tensor([[1, 1, 0, 0, 0]]))
[Thread 0x7fff9b9f4700 (LWP 3201) exited]
[Thread 0x7fff90dc9700 (LWP 3207) exited]
[Thread 0x7fff915ca700 (LWP 3206) exited]
[Thread 0x7fff91dcb700 (LWP 3205) exited]
[Thread 0x7fff925cc700 (LWP 3204) exited]
[Thread 0x7fff92dcd700 (LWP 3203) exited]
[Thread 0x7fff9e1f5700 (LWP 3200) exited]
[Thread 0x7fffa09f6700 (LWP 3199) exited]javascript:;
[Thread 0x7fffa51f7700 (LWP 3198) exited]
[Thread 0x7fffa59f8700 (LWP 3197) exited]
[Inferior 1 (process 3187) exited normally]
No stack.
(gdb) quit

malfet · August 19, 2020, 10:12pm

Thank you for the quick reply. Can you please also issue disassemble command when exception is hit in gdb, by executing:
gdb python3.7 -ex “set args -c ‘import torch;print(torch.rand(1,2,5).max(1))’” -ex “run” -ex “bt” -ex “disassemble”

Stephane_Bersier · August 19, 2020, 10:53pm

Sorry, the file was incomplete…
So, here it is again:

malfet · August 19, 2020, 11:23pm

Thank you for the detailed repro instructions, filed https://github.com/pytorch/pytorch/issues/43300

Stephane_Bersier · August 19, 2020, 11:39pm

I’ll keep an eye on github about this issue.
Thank you very much for having spent some time on this.
Best regards,
Stéphane

Andre_Godinho · October 19, 2020, 9:28pm

Did anyone fix this?

Have the same problem, torch==1.6.0 and core dumps when using torch.max()

Stephane_Bersier · October 20, 2020, 6:38am

The issue should be fixed in version 1.7
See https://github.com/pytorch/pytorch/issues/43300