Cublas runtime error for torch.bmm on RTX 2080Ti

Tal_Mezheritsky · July 10, 2019, 3:23pm

Hi, I am trying to run a script from github https://github.com/vsitzmann/deepvoxels. I have a RTX 2060 GPU on my laptop and a RTX 2080Ti GPU on a server. I use the exact same conda environment on both machines, both have CUDA version 10.1. I am able to run the code without a problem on my laptop, however I can’t make it work on the server.

Initially, the code didn’t work on either machine. I simply updated the version of pytorch on my laptop and it solved the problem. I ran this command: conda install pytorch torchvision cudatoolkit=10.0 -c pytorch on the server , it updated pytorch and torchvision as asked however the same problem remains. Here is the ouput

Begin training...
Traceback (most recent call last):
  File "run_deepvoxels.py", line 395, in <module>
    main()
  File "run_deepvoxels.py", line 386, in main
    train()
  File "run_deepvoxels.py", line 183, in train
    grid2world=grid_origin)
  File "/store/usagers/tamez/deepvoxels/projection.py", line 90, in comp_lifting_idcs
    voxel_bounds_min, voxel_bounds_max, _ = self.compute_frustum_bounds(camera_to_world, world2grid)
  File "/store/usagers/tamez/deepvoxels/projection.py", line 73, in compute_frustum_bounds
    p = torch.bmm(camera_to_world.repeat(8, 1, 1), corner_points)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:450

This is my first post here so if I did not provide some information please let me know.

Thanks in advance.

Tal_Mezheritsky · July 11, 2019, 2:20pm

After reinstalling pytorch multiple times through conda I could not make the script work.

I identified the problem by running torch.version.cuda in the python console on the server and realising python was using CUDA 9 instead of CUDA 10. The command torch.__file__ then showed that it was not even using torch from my conda environment, it was using another torch install on the server. Once I got rid of the other pytorch install torch.version.cuda showed CUDA 10 and everything works now.