Segmentation fault (Core dump) when using GPU

Recently, the GPU driver on the server is updated to 390.12 by the IT support, then I also update the CUDA9 and cudnn library corresponding. However, since then I started to get segmentation error once I call .cuda() function. I attach the following stack traces. The example I use is the official mnist example:

gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/zhe/anaconda3/envs/pytorch_cuda9/bin/python3.6...done.
(gdb) r mnist.py 
Starting program: /home/zhe/anaconda3/envs/pytorch_cuda9/bin/python mnist.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /home/zhe/anaconda3/envs/pytorch_cuda9/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Detaching after fork from child process 34372.
Detaching after fork from child process 34373.
Detaching after fork from child process 34374.
[New Thread 0x7fffac683700 (LWP 34377)]
[New Thread 0x7fffa8a2a700 (LWP 34379)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa8a2a700 (LWP 34379)]
0x00007fffeac828d5 in ?? () from /usr/lib64/nvidia/libcuda.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 libuuid-2.23.2-43.el7_4.2.x86_64
(gdb) bt
#0  0x00007fffeac828d5 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#1  0x00007fffeadd2914 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffead6ee80 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007ffff7bc6e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff78f434d in clone () from /lib64/libc.so.6
(gdb) 

did you reinstall with the cuda 9 version?

Yes, I install a local anaconda 3 and install the cuda 9 version pytorch.

Did you manage to solve this issue?

I get exactly the same error when I try to use the GPU (Tesla K80) on Scientific Linux. Anaconda 3, Cuda 9.1, Nvidia Driver version 390.30.

EDIT: Apparently, for me it is a GPU related issue. When I run nvidia-smi, I detected that some of my GPUs have a “Volatile Uncorr. ECC”. When I set CUDA_VISIBLE_DEVICES to anything other than those, I can run the code on GPU. The issue should be resolved when I reset the GPUs and reboot.

1 Like