Torch.bmm is throwing cublas runtime error

I am trying to use torch.bmm using pytorch version 0.4, as I have an old project in pytorch-0.4 and I have to insert a code snippet into that project for some testing. I took this snippet from another project implemented in a newer version of pytorch. When I am trying to run this code. I am getting this error…
I have also tried torch.matmul but I got the same result.

ts3 = torch.bmm(ts1, ts2)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/

Is it the pytorch version issue, if is it, is there any alternative function for older version, if not, how can I implement this function on my own?

Here is my code snippet which I am trying to add

def call(self, batch_kernel):
B, H, W = batch_kernel.size() #[B, l, l]
ts1 = batch_kernel.view((B, 1, H * W))
ts2 = self.weight.expand((B, ) + self.size)
ts3 = torch.matmul(ts1, ts2)
return ts3.view((B, -1))

Could you post the CUDA version you are using for this older PyTorch installation as well as the shapes of ts1 and ts2?
This would allow us to check, if this might still be an issue or if it was fixed already.

Cuda compilation tools, release 8.0, V8.0.61
ts1 size: torch.Size([16, 1, 441])
ts2 Size: torch.Size([16, 441, 10])
I also have tried this snippet instead of torch.bmm and torch.matmul, but got the same error

    B, H, W = batch_kernel.size()
    ts1 = batch_kernel.view((B, 1, H * W))
    ts2 = self.weight.expand((B, ) + self.size)
    s1,s2,s3 = ts2.size()
    #ts3 = torch.bmm(ts1, ts2)
    out = torch.Tensor(B, s3)
    for i, batch_v in enumerate(ts1):
        out[i] = (batch_v @ ts2).t()
    return ts3.view((B, -1))

However, I tried the same project on pytorch=> 1.0, it is working fine with bmm but I am getting worse results with newer version at the end.

Is torch.matmul also raising an error in 1.5?
How much difference do you see between 0.4 and 1.5 and how did you create the baseline, if this error is raised?

No. Its working fine in =>v1.0.
Actually, I am working on RCAN project (based on Image Super resolution). Its original code is written in v0.4. While doing some tests on a dataset it gives PSNR=37.xx, if I try it on v1.0 and v1.1 it gives PSNR=33.xx.I dont have much time to resolve that issue so I just started working on older version but when adding some kernel in that project, upper problem arises.

If you really need to get the matmul working in 0.4, you could potentially implement it with a for loop, which might give you a performance hit, but might at least work functionality wise.
Alternatively, you could use the current CUDA and cublas versions and build 0.4 from source, assuming that you installed 0.4 from the binaries (not sure if that would work).