Segmentation fault (Core dump) when using GPU


(Elliothe) #1

Recently, the GPU driver on the server is updated to 390.12 by the IT support, then I also update the CUDA9 and cudnn library corresponding. However, since then I started to get segmentation error once I call .cuda() function. I attach the following stack traces. The example I use is the official mnist example:

gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/zhe/anaconda3/envs/pytorch_cuda9/bin/python3.6...done.
(gdb) r mnist.py 
Starting program: /home/zhe/anaconda3/envs/pytorch_cuda9/bin/python mnist.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /home/zhe/anaconda3/envs/pytorch_cuda9/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Detaching after fork from child process 34372.
Detaching after fork from child process 34373.
Detaching after fork from child process 34374.
[New Thread 0x7fffac683700 (LWP 34377)]
[New Thread 0x7fffa8a2a700 (LWP 34379)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa8a2a700 (LWP 34379)]
0x00007fffeac828d5 in ?? () from /usr/lib64/nvidia/libcuda.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 libuuid-2.23.2-43.el7_4.2.x86_64
(gdb) bt
#0  0x00007fffeac828d5 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#1  0x00007fffeadd2914 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffead6ee80 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007ffff7bc6e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff78f434d in clone () from /lib64/libc.so.6
(gdb) 


(Simon Wang) #2

did you reinstall with the cuda 9 version?


(Elliothe) #3

Yes, I install a local anaconda 3 and install the cuda 9 version pytorch.