Cublas runtime error : library not initialized at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCGeneral.c:383

net = Net()
net = net.cuda()

input = Variable(torch.randn(1, 1, 32, 32))
input = input.cuda()
output = net(input)

Traceback (most recent call last):
File “/home/shijinzhu/!work_python/python_pytorch/demo001 pytorch_test/test.py”, line 73, in
output = net(input)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 202, in call
result = self.forward(*input, **kwargs)
File “/home/shijinzhu/!work_python/python_pytorch/demo001 pytorch_test/test.py”, line 52, in forward
x = F.relu(self.fc1(x))
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 202, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/linear.py”, line 54, in forward
return self._backend.Linear()(input, self.weight, self.bias)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/functions/linear.py", line 10, in forward
output.addmm
(0, 1, input, weight.t())
RuntimeError: cublas runtime error : library not initialized at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCGeneral.c:383

Thank you

I am also getting the same error when I run my code in multiple GPUs. But the error is not consistent, sometime I get it, sometime not.

Is there any workaround to get rid of this problem? @smth

1 Like

I am facing the same error. So is this related with machines that have multiple GPUs?

I think I’ve found the workaround. When we do .cuda(), we may specify the GPU device we want to load data or model to make sure they are on the same GPU. For example,

net = Net()
net = net.cuda( 0 )

input = Variable(torch.randn(1, 1, 32, 32))
input = input.cuda( 0 )
output = net(input)

I had also faced this issue even on single GPU.
I noticed that cublas samples required sudo permission to Initialize.
Also to avoid root permission, I removed the cache files in ~/.nv directory.
Hope this solution helps.

I’ve faced same problem. But, on my server, this problem is caused by that there is no enough memory on GPU devices the program is using. You may specify another GPU for your program by using torch.cuda.set_device(id_of_idle_device).
Hop this can help you.

2 Likes

I’m also facing the same issue. Although I removed the cache files in .nv directory, same error would be raised when running my code.

@Rohith_AP, @ShawnGuo

“sudo rm -r ~/.nv” works for my 4-GPU machine to remove error below
RuntimeError: cublas runtime error : library not initialized at /py/conda-bld/pytorch_1493681908901/work/torch/lib/THC/THCGeneral.c:394

FYI, it also solves error below
File “/opt/anaconda/lib/python3.6/site-packages/torch/nn/functional.py”, line 40, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

Thanks!

5 Likes

Hello,
I am having the same issue and the fix above does not work.
I think it is related to this: https://github.com/torch/cutorch/issues/677

When trying with CUDA_VISIBLE_DEVICES=1
not using set_device throw the runtime error,
then trying torch.cuda.set_devices(2) throws an ordinal error, I thought torch was counting from 1.

any help ?

CUDA_VISIBLE_DEVICES is 0-indexed. PyTorch is also 0-indexed.

yes I found out also that we actullay need to use torch.cuda.device(x) with x o-indexed
and not set_device which seems not to work properly.

@Rohith_AP, @ShawnGuo

“sudo rm -f ~/.nv” works for me. It has troubled me for a long time.
Thank a lot!

Steven

I’m finetuning the vgg19_bn on my own dataset, and I faced the same problem too.

With the insructions above, I removed the .nv file by command
sudo rm -rf ~/.nv

However, when I run the gpu-version CNN, the error shows again. And I found the .nv file appeared again. And I changed the batch_size from 64 into 32, the code runed well. The same solution as above @ShawnGuo . Thasnks a lot. By a way, when the memory is not enough for the code, should it raise the error? Can it give a more accurate error infomation? @smth .

Thanks! It worked for me.

tthx, it works for me,
BTW, can you explain why it works in details?

when I use pytorch 0.3, it works. But when I use 0.4 compiled from master, my code throws this error. remove nv doesn’t work.

3 Likes

@tiantong, could I ask how you fixed this problem, it also happens to me when upgrading to pytorch v0.4? thanks

1 Like

whats the source of this problem?

is there not a way to set these indices globally once for everything?

I think:

export CUDA_VISIBLE_DEVICES=$i

is what Im looking for.