Illegal instruction (core dumped) for Cuda in 1.0.0.dev

elmarculino · October 4, 2018, 9:12pm

Get “Illegal instruction (core dumped)” error when trying to copy object in CUDA memory. I tried with python 3.6 and 3.7, CUDA 9.0 and 9.2. I have no idea of how to debug this.

This code works fine with pytorch 0.4.1 but always fails in 1.0.0.dev

import torch
torch.tensor([1.,2.]).cuda()

Any ideia of how I can solve this?

GDB output:

(gdb) run teste.py
Starting program: /home/marco/anaconda3/envs/fastai/bin/python teste.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
1.0.0.dev20181004
9.2.148
[New Thread 0x7fffae733700 (LWP 4189)]
True
GeForce GTX 1070

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007fffb9057bc3 in at::cuda::detail::initGlobalStreamState() ()
   from /home/marco/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
(gdb)

PyTorch version: 1.0.0.dev20181003
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 396.54
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy (1.15.2)
[pip] torch (1.0.0.dev20181003)
[conda] cuda92 1.0 0 pytorch
[conda] pytorch-nightly 1.0.0.dev20181003 py3.6_cuda9.2.148_cudnn7.1.4_0 [cuda92] pytorch

colesbury · October 4, 2018, 9:35pm

This is weird. Thanks for reporting it. Can you report the output of the following GDB commands after the “Thread 1 “python” received signal SIGILL, Illegal instruction”?

bt (backtrace)
disas (disassemble)

Do you know what CPU you have? On Linux, you can usually find out by cat /proc/cpuinfo

elmarculino · October 4, 2018, 9:44pm

The CPU is an AMD Phenom II X6

GDB output: https://pastebin.com/9uyXXuZ3

Thanks

colesbury · October 5, 2018, 3:24pm

Thanks, I think this PR will fix the problem:

I’ll work on merging it. Should be in a nightly build within a few days.

elmarculino · October 5, 2018, 6:21pm

Thanks for looking at this.

I built from source with the PR you linked.

I still got the “Illegal instruction (core dumped)” error, but it’s different this time:


Thread 1 "python" received signal SIGSEGV, Segmentation fault.
THCPModule_initExtension (self=<optimized out>) at torch/csrc/cuda/Module.cpp:354
354       auto _state_cdata = THPObjectPtr(PyLong_FromVoidPtr(state));

GDB complete output: https://pastebin.com/f4kSG5ve

Is still the same problem or did I mess up while building from source?

colesbury · October 5, 2018, 7:19pm

Thanks for trying out the PR. I’m not sure exactly what’s going on, but that’s a different error (“Segmentation fault” vs. “Illegal instruction”). If you’re building from source, make sure you run python setup.py clean before you rebuild. Sometimes, only some files get rebuilt which can cause those sorts of crashes.

elmarculino · October 5, 2018, 7:24pm

That was the first time I built. I will do that before rebuild.

Can I do anything to help?

colesbury · October 5, 2018, 9:15pm

Could you try rebuilding with DEBUG=1? That may provide better information:

python setup.py clean
DEBUG=1 python setup.py install

If you run into the error, could you try running the following GDB commands:

bt
disas
info registers

Thanks for helping to debug this

elmarculino · October 5, 2018, 11:17pm

The gdb logs: https://pastebin.com/sJ2wjiSi

Hope this helps

elmarculino · October 10, 2018, 11:58pm

@colesbury

With the last pytorch-nightly update the errors are gone.

(fastai) marco@phenom:~/MachineLearning$ python
Python 3.7.0 (default, Jun 28 2018, 13:15:42) 
[GCC 7.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.0.0.dev20181010'
>>> torch.cuda.is_available()
True
>>> torch.tensor([1.,2.]).cuda()
tensor([1., 2.], device='cuda:0')
>>>

Thanks for your help.

colesbury · October 11, 2018, 12:08am

That’s great! Thanks for folllwing up

randomuser · October 26, 2018, 1:05pm

FWIW, I had the same problem as @elmarculino but I could solve it by installing my own MAGMA library.

Installing PyTorch with DEBUG=1 and running under gdb revealed that there was a problem with a MAGMA-related function (see https://pastebin.com/tpb28w7V ). So I removed the conda package magma-cuda92, installed MAGMA 7.3.0 from source, recompiled and it worked.

pancho111203 · November 12, 2018, 7:32am

@colesbury
I’m getting the same error. Tried to do a clean install and it’s still happening.
Python version: 3.7.0
Pytorch version: ‘1.0.0a0+4b86a21’ (built from source)

GDB output: https://pastebin.com/H4txr59u