RuntimeError: CUDA error: unknown error

ojh3636 · August 8, 2018, 5:44am

I faced to something weird error when I try to move Tensor to GPU.
First my machine spec is

Ubuntu 16.04
nvidia driver: 390.77
GPU: TitanX (Pascal)

Torch 0.4.1 (with cuda9.0)

I didn’t install any cuda Toolkit in my local, just download pytorch via anaconda to my virtualenv. (because I saw some posts that downloading pytorch via anaconda will install automatically cuda and what I need to do is just install proper nvidia driver in my machine)

After install pytorch and run some simple code

import torch
a = torch.rand(5, 3)
device = torch.device(‘cuda’)
a.to(device)

Then I encountered above error. (I also checked torch.cuda.is_available() and that returned True)

But when I run the python in sudo permission (like sudo ./~~/env/bin/python3) Then the above simple code perfectly worked. <some posts said that the cuda initialization once in sudo then normal user permission can run cuda normaly but in my case, it didn’t work. Every time I need to run python as sudo to make perform above code>

Please help.

Jk749 · August 8, 2018, 6:14am

One problem I guess you might be having is cuda installation.
When you installed nvidia driver and cuda toolkit, do you have cuda toolkit path available and loaded in your environment variables.

Try setting, LD_LIBRARY_PATH to /usr/local/cuda/lib64

ojh3636 · August 8, 2018, 6:28am

I didn’t install cudaToolkit in my /usr/local. I just installed pytorch in conda virtualenv, and when I looked the “~/anaconda3/envs/myenv/lib” There are all the cuda library (eg libcudart.so*). I think activating my virtualenv is enough rather than setting ld_library_path. And when I inspect my gpu memory by nvidia-smi, I can see the gpu memory is allocated every time I run the “a.to(device)” code (and not be free, and raise out of memory when I run this code about15times.) So I think cuda library successfully loaded but some error (that caused by permission) occur. I don’t know why…this error occur

p.s
I’ve reinstalled nvidia driver after posting this post, but It didn’t help…

ptrblck · August 8, 2018, 11:00am

Did you restart the machine after reinstalling the drivers?

If so, could you try to install PyTorch via conda with CUDA9.1 or 9.2?

ojh3636 · August 8, 2018, 3:17pm

Yes, I do restart after reinstall nvidia driver (First reinstall 390.77 & reboot & cuda9.0 -> didn’t work, reinstall 384.xx & reboot & cuda9.0 -> also didn’t work. ) I can’t use the machine now, So I’ll add info after doing with cuda9.2 tmr(I don’t know why but I can’t find pytorch-cuda9.1 in conda repository…) . Thank you for reply

ojh3636 · August 9, 2018, 3:19am

Yeah~ I solved it!! I just upgrade my nvidia driver to 396 version and do with cuda92. Thank you so much. It works well!! But still don’t know what was problem in previous version.

pinocchio · June 30, 2020, 8:45pm

What is the exact error message you are getting? Can you paste that please?