Traceback (most recent call last):
File “/home/shijinzhu/!work_python/python_pytorch/demo001 pytorch_test/test.py”, line 73, in
output = net(input)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 202, in call
result = self.forward(*input, **kwargs)
File “/home/shijinzhu/!work_python/python_pytorch/demo001 pytorch_test/test.py”, line 52, in forward
x = F.relu(self.fc1(x))
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 202, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/linear.py”, line 54, in forward
return self._backend.Linear()(input, self.weight, self.bias)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/functions/linear.py", line 10, in forward
output.addmm(0, 1, input, weight.t())
RuntimeError: cublas runtime error : library not initialized at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCGeneral.c:383
I think I’ve found the workaround. When we do .cuda(), we may specify the GPU device we want to load data or model to make sure they are on the same GPU. For example,
I had also faced this issue even on single GPU.
I noticed that cublas samples required sudo permission to Initialize.
Also to avoid root permission, I removed the cache files in ~/.nv directory.
Hope this solution helps.
I’ve faced same problem. But, on my server, this problem is caused by that there is no enough memory on GPU devices the program is using. You may specify another GPU for your program by using torch.cuda.set_device(id_of_idle_device).
Hop this can help you.
“sudo rm -r ~/.nv” works for my 4-GPU machine to remove error below
RuntimeError: cublas runtime error : library not initialized at /py/conda-bld/pytorch_1493681908901/work/torch/lib/THC/THCGeneral.c:394
FYI, it also solves error below
File “/opt/anaconda/lib/python3.6/site-packages/torch/nn/functional.py”, line 40, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
When trying with CUDA_VISIBLE_DEVICES=1
not using set_device throw the runtime error,
then trying torch.cuda.set_devices(2) throws an ordinal error, I thought torch was counting from 1.
I’m finetuning the vgg19_bn on my own dataset, and I faced the same problem too.
With the insructions above, I removed the .nv file by command sudo rm -rf ~/.nv
However, when I run the gpu-version CNN, the error shows again. And I found the .nv file appeared again. And I changed the batch_size from 64 into 32, the code runed well. The same solution as above @ShawnGuo . Thasnks a lot. By a way, when the memory is not enough for the code, should it raise the error? Can it give a more accurate error infomation? @smth .