RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:27

Hi guys,

I have a GPU GeForce GTX TITAN X with python 3.7.4 and CUDA 10.1. I am training my GAN on this GPU and it seems like the following errors occasionally happens during the training process and it always related to the spectral norm which is provided
by torch.nn.utils.spectral_norm here

In this example it happens at iteration 180 and another fail (not list hear) happens at iteration 5800. Is that a bug?

  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/HDD2/hwan7885/proj_hr/gan_ST_map_cnn/sagan/train_cifar10.py", line 15, in <module>
    trainer.train()
  File "/mnt/HDD2/hwan7885/proj_hr/gan_ST_map_cnn/sagan/trainer.py", line 261, in train
    fake_images = self.G(z, real_labels)
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/HDD2/hwan7885/proj_hr/gan_ST_map_cnn/sagan/sagan_models.py", line 188, in forward
    act4 = self.block4(act3, labels)    # n x g_conv_dim*2 x 64 x 64
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/HDD2/hwan7885/proj_hr/gan_ST_map_cnn/sagan/sagan_models.py", line 151, in forward
    x0 = self.snconv2d0(x0)
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 714, in _call_impl
    result = hook(self, input)
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 99, in __call__
    setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))
  File "/mnt/HDD2/hwan7885/anaconda3/envs/Nepean/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 85, in compute_weight
    sigma = torch.dot(u, torch.mv(weight_mat, v))
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:27```

Could you check, if you might be running out of memory?
If that’s not the case, could you rerun the script via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace here, please?

Sorry for the late reply, but it seems like the error does not happen if I add CUDA_LAUNCH_BLOCKING=1. Why is it?

It could be a race condition or just a coincidence, since you said the issue was non-deterministic.
Could you update your PyTorch installation to use CUDA10.2 and rerun the script?
If you are still seeing this issue (and have a local CUDA installation), you could run a single training iteration (forward and backward) with cuda-memcheck --tool racecheck to check the code for potential races.