GPU: no CUDA-capable device is detected

izoiluzja · May 9, 2018, 11:32am

Greetings pytorchers.

I am trying to run pytorch on a machine with eight Testla K10G1.8GB GPUs (confirmed by nvidia-smi).
CUDA is installed correctly and the samples from nvidia run (as do my own .cu tests).
A simple example gives me a runtime error, that I am finding difficult to comprehend.

Pointers on how to debug this problem (details are below) would be awesome.

Thank you. Here goes…

uname-ar:
Linux 4398392dfc97 4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

torch.version:
0.5.0a0+ef477b2 (compiled from source following instructions on github)

Command (I see the -DWITH_CUDA compiler flag flash by in gcc):
CUDA_HOME=/usr/local/cuda python setup.py install

Additional (in the pytorch directory):
grep -R “-DWITH_CUDA” ./*:
./setup.py: extra_compile_args += [’-DWITH_CUDA’]
./tools/cpp_build/libtorch/CMakeLists.txt: add_definitions(-DWITH_CUDA)
./torch/lib/THD/CMakeLists.txt: ADD_DEFINITIONS(-DWITH_CUDA=1)
./torch/lib/build/THD/CMakeFiles/THD.dir/flags.make:CXX_DEFINES = -DWITH_CUDA=1 -DWITH_GLOO=1 -D_THD_CORE=1

nvcc --version:
NVIDIA ® Cuda compiler driver
Copyright © 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

gcc -v:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu

Code (test.py):
import torch
n_devices = torch.cuda.device_count ()
a = torch.cuda.FloatTensor([1.])

Output:
THCudaCheck FAIL file=/home/tyoung/usr/src/pytorch/aten/src/THC/THCGeneral.cpp line=71 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File “cuda.py”, line 4, in
a = torch.cuda.FloatTensor([1.])
File “/home/tyoung/usr/bin/anaconda3/lib/python3.6/site-packages/torch/cuda/init.py”, line 161, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /home/tyoung/usr/src/pytorch/aten/src/THC/THCGeneral.cpp:71

Note, if I do the same thing on a different machine with one GeForce GTX 680 and run
CUDA_VISIBLE_DEVICES=1 python test.py
ie., point it to a nonexistant device, I get the same output as above. Naturally, there,
CUDA_VISIBLE_DEVICES=0 python test.py
runs code correctly.

How (or why?) are my Testla K10s hiding from (py)torch?

Best,
Toby

ptrblck · May 9, 2018, 2:42pm

While building from source, did you get any information regarding the found CUDA version and its location etc.?

izoiluzja · May 9, 2018, 3:18pm

Thank you for the response.
I will have another try tomorrow and post any configure/compile indicators I find.
Best, T

izoiluzja · May 10, 2018, 1:29pm

Good pointer @ptrblck ! Thanks, that helped me to solve the issue.

After searching through the configuration scripts this morning, I found some dodgy looking links to cuda library. A careful, simple, fresh install from scratch and pytorch works just great and as expected.

Best,
T