Torch.bmm is throwing cublas runtime error

Cuda compilation tools, release 8.0, V8.0.61
ts1 size: torch.Size([16, 1, 441])
ts2 Size: torch.Size([16, 441, 10])
I also have tried this snippet instead of torch.bmm and torch.matmul, but got the same error

    B, H, W = batch_kernel.size()
    ts1 = batch_kernel.view((B, 1, H * W))
    ts2 = self.weight.expand((B, ) + self.size)
    s1,s2,s3 = ts2.size()
    #ts3 = torch.bmm(ts1, ts2)
    out = torch.Tensor(B, s3)
    for i, batch_v in enumerate(ts1):
        out[i] = (batch_v @ ts2).t()
    return ts3.view((B, -1))

However, I tried the same project on pytorch=> 1.0, it is working fine with bmm but I am getting worse results with newer version at the end.