Unstable Cublas error with Nvidia GeForce RTX 2080 Ti

ishalyminov · April 21, 2019, 10:13pm

Hi,
So I’m performing this batch matrix multiplication:
emb = torch.bmm(a, b)

where

(Pdb) a.shape torch.Size([32, 1, 5])

(Pdb) b.shape torch.Size([32, 5, 128])

I get the following:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/learning_to_rank_kvmemnn.py", line 143, in <module>
    train(model, (xs, mems, ys, cands), word_vocab, **CONFIG)
  File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/learning_to_rank_kvmemnn.py", line 68, in train
    Variable(long_tensor_t(cands)))
  File "/home/ishalyminov/.virtualenvs/alana_learning_to_rank3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/kvmemnn.py", line 65, in forward
    emb = torch.bmm(a, b)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:441

When I repeat this exact operation at that exact place in pdb, it succeeds. Moreover, the error only reproduces on an Nvidia RTX 2080i card; the issue does not occur on a K20.

My torch version is torch==1.0.1.post2

ishalyminov · April 29, 2019, 6:51pm

Looks like the problem is gone once I re-installed PyTorch as per the guide on the official site.