Get “Illegal instruction (core dumped)” error when trying to copy object in CUDA memory. I tried with python 3.6 and 3.7, CUDA 9.0 and 9.2. I have no idea of how to debug this.
This code works fine with pytorch 0.4.1 but always fails in 1.0.0.dev
import torch
torch.tensor([1.,2.]).cuda()
Any ideia of how I can solve this?
GDB output:
(gdb) run teste.py
Starting program: /home/marco/anaconda3/envs/fastai/bin/python teste.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
1.0.0.dev20181004
9.2.148
[New Thread 0x7fffae733700 (LWP 4189)]
True
GeForce GTX 1070
Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007fffb9057bc3 in at::cuda::detail::initGlobalStreamState() ()
from /home/marco/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
(gdb)
PyTorch version: 1.0.0.dev20181003
Is debug build: No
CUDA used to build PyTorch: 9.2.148
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 396.54
cuDNN version: Could not collect
This is weird. Thanks for reporting it. Can you report the output of the following GDB commands after the “Thread 1 “python” received signal SIGILL, Illegal instruction”?
bt (backtrace) disas (disassemble)
Do you know what CPU you have? On Linux, you can usually find out by cat /proc/cpuinfo
I still got the “Illegal instruction (core dumped)” error, but it’s different this time:
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
THCPModule_initExtension (self=<optimized out>) at torch/csrc/cuda/Module.cpp:354
354 auto _state_cdata = THPObjectPtr(PyLong_FromVoidPtr(state));
Thanks for trying out the PR. I’m not sure exactly what’s going on, but that’s a different error (“Segmentation fault” vs. “Illegal instruction”). If you’re building from source, make sure you run python setup.py clean before you rebuild. Sometimes, only some files get rebuilt which can cause those sorts of crashes.
FWIW, I had the same problem as @elmarculino but I could solve it by installing my own MAGMA library.
Installing PyTorch with DEBUG=1 and running under gdb revealed that there was a problem with a MAGMA-related function (see https://pastebin.com/tpb28w7V ). So I removed the conda package magma-cuda92, installed MAGMA 7.3.0 from source, recompiled and it worked.
@colesbury
I’m getting the same error. Tried to do a clean install and it’s still happening.
Python version: 3.7.0
Pytorch version: ‘1.0.0a0+4b86a21’ (built from source)