Hi,
So I’m performing this batch matrix multiplication:
emb = torch.bmm(a, b)
where
(Pdb) a.shape torch.Size([32, 1, 5])
(Pdb) b.shape torch.Size([32, 5, 128])
I get the following:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/learning_to_rank_kvmemnn.py", line 143, in <module>
train(model, (xs, mems, ys, cands), word_vocab, **CONFIG)
File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/learning_to_rank_kvmemnn.py", line 68, in train
Variable(long_tensor_t(cands)))
File "/home/ishalyminov/.virtualenvs/alana_learning_to_rank3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ishalyminov/data/alana_learning_to_rank3/alana_learning_to_rank/kvmemnn.py", line 65, in forward
emb = torch.bmm(a, b)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:441
When I repeat this exact operation at that exact place in pdb, it succeeds. Moreover, the error only reproduces on an Nvidia RTX 2080i card; the issue does not occur on a K20.
My torch version is torch==1.0.1.post2