Installation issue

ironmanaudi_leon · February 10, 2020, 9:47am

After installing the pytorch, i found that i could not run the code with GPU.

I’ve already got my CUDA installed which is 10.0.130, and the GPU driver is functioning, but no cuDNN. The server is DGX station and the version of pytotrch is 1.4.0.

ironmanaudi_leon · February 10, 2020, 10:35am

Environment

PyTorch version: 1.4.0
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu18.04
GCC version: (GCC) 7.3.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-DGXS
GPU 1: Tesla V100-DGXS
GPU 2: Tesla V100-DGXS
GPU 3: Tesla V100-DGXS

Nvidia driver version: 410.79
cuDNN version: none

ironmanaudi_leon · February 10, 2020, 10:44am

Test

when running the following code in test.py:

import torch
print(torch.cuda.is_available())

it report Segmentation fault (core dumped).

The stack traces attached here

(base) ykzhang@qdgx1-DGX-Station:~$ gdb python
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) r test.py 
Starting program: /home/ykzhang/anaconda3/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007fffa16691bb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007fffa16691bb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007fffa16346e5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffa16362ac in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffa167f42d in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffb6679a55 in ?? () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#5  0x00007fffb6679ab1 in ?? () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#6  0x00007ffff7bc5827 in __pthread_once_slow (once_control=0x7fffb68d0af8, init_routine=0x7fffb6679aa0) at pthread_once.c:116
#7  0x00007fffb66b4ec9 in ?? () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#8  0x00007fffb6674a3a in ?? () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#9  0x00007fffb667996b in ?? () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#10 0x00007fffb66a197a in cudaGetDeviceCount () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#11 0x00007fffe818a19e in THCPModule_isDriverSufficient(_object*, _object*) () from /home/ykzhang/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#12 0x00005555556b4ff1 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1578510683607/work/Objects/call.c:633
#13 0x00005555556b5231 in _PyCFunction_FastCallKeywords (func=0x7ffff47b65a0, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /tmp/build/80754af9/python_1578510683607/work/Objects/call.c:734
#14 0x0000555555719a5d in call_function (kwnames=0x0, oparg=0, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:4568
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:3093
#16 0x00005555556b468b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1578510683607/work/Objects/call.c:283
#17 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1578510683607/work/Objects/call.c:408
#18 0x00005555557196c9 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:4616
#19 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:3093
#20 0x000055555566e6f9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:3930
#21 0x000055555566f5f4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:3959
#22 0x000055555566f61c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /tmp/build/80754af9/python_1578510683607/work/Python/ceval.c:524
#23 0x0000555555770974 in run_mod () at /tmp/build/80754af9/python_1578510683607/work/Python/pythonrun.c:1035
#24 0x000055555577acf1 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1578510683607/work/Python/pythonrun.c:988
#25 0x000055555577aee3 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1578510683607/work/Python/pythonrun.c:429
#26 0x000055555577bf95 in pymain_run_file (p_cf=0x7fffffffdf10, filename=0x5555558b2850 L"test.py", fp=0x5555558fb4a0) at /tmp/build/80754af9/python_1578510683607/work/Modules/main.c:434
#27 pymain_run_filename (cf=0x7fffffffdf10, pymain=0x7fffffffe020) at /tmp/build/80754af9/python_1578510683607/work/Modules/main.c:1613
#28 pymain_run_python (pymain=0x7fffffffe020) at /tmp/build/80754af9/python_1578510683607/work/Modules/main.c:2874
#29 pymain_main () at /tmp/build/80754af9/python_1578510683607/work/Modules/main.c:3414
#30 0x000055555577c0bc in _Py_UnixMain () at /tmp/build/80754af9/python_1578510683607/work/Modules/main.c:3449
#31 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556500a0 <main>, argc=2, argv=0x7fffffffe178, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe168) at ../csu/libc-start.c:310
#32 0x0000555555724990 in _start () at ../sysdeps/x86_64/elf/start.S:103
(gdb)

albanD · February 10, 2020, 4:29pm

Interesting.
Can you run the cuda samples properly? It seems like the segfault comes from inside the cuda library itself…

ironmanaudi_leon · February 11, 2020, 2:29am

Hi, Alban, thanks for the quick response.
Well, I am not available for running the samples, but I will try to connect someone who has the permission, so I’ll let u know as soon as the results come out.

ironmanaudi_leon · February 11, 2020, 3:40am

There is only one sample i can run, and it seems like the problem is not with CUDA.

ykzhang@qdgx1-DGX-Station:/usr/local/cuda/samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-DGXS-32GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11652.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12856.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			725258.7

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

ironmanaudi_leon · February 11, 2020, 3:46am

After a reboot, everything’s fun…

albanD · February 11, 2020, 2:31pm

Must have been a bad state on the side of cuda. The environment script run cuda.is_available() as well and seem to run properly.
Happy this is solved.