Cublas runtime error for mm and bmm for tensors with custom strides

hoaxingz · June 17, 2019, 4:38am

When I use mm or bmm on a cuda tensor with custom strides (either obtained through unfold or as_strided), cublas runtime error occurs. The operations run correctly when device = cpu.

Reproducing examples:

bmm example:

import torch

N = 10
x = torch.arange(N * (N - 1), device='cuda:5').float().reshape(N, N - 1) # shape (10, 9)

unf1 = x.unfold(1, N // 2, 1) # shape (10, 5, 5)
out1 = unf1.bmm(unf1) # error below

unf2 = x.as_strided((10, 5, 5), (9, 1, 1)) # shape (10, 5, 5)
out2 = unf2.bmm(unf2) # error below

In both cases I get

On entry to SGEMM parameter number 8 had an illegal value
RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at /pytorch/aten/src/THC/THCBlas.cu:441

mm example:

import torch

N = 10
x = torch.arange(N - 1, device='cuda:5').float() # shape (9,)
unf1 = x.unfold(0, N // 2, 1) # shape (5, 5)
out1 = unf1.mm(unf1) # error below

unf2 = x.as_strided((5, 5), (1, 1)) # shape (5, 5)
out2 = unf2.mm(unf2) # error below

In both cases I get

On entry to SGEMM parameter number 8 had an illegal value
RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at /pytorch/aten/src/THC/THCBlas.cu:258

Version:
pytorch 1.0.0
cuda 10.1
gpu Titan Xp or GTX 1080

albanD · June 17, 2019, 4:29pm

Hi,

I guess the fact that both dimensions on which the mm should be done have a stride of 1 trips the logic used to decide whether to transpose or not here.
@ngimel can you confirm this?

ngimel · June 17, 2019, 5:13pm

No transposes would help, the sizes of unf1 are (10,5,5) and the strides are (9,1,1), and for lda/ldb blas standard specifies (I’m quoting for lda, but ldb is similar modulo permutation of m,n,k and transpose value)

 LDA    - INTEGER.
             On entry, LDA specifies the first dimension of A as
             declared in the calling (sub) program. When  TRANSA =
             'N' or 'n' then LDA must be at least  max( 1, m ),
             otherwise  LDA must be at least  max( 1, k ).
             Unchanged on exit.

so, at least one of the last strides should be 5 for pytorch to be able to correct situation with setting appropriate transposes.
Using self-overlapping tensors (like this one) on cuda is not the greatest idea anyway. Pytorch tries to detect it in some places and throw errors or call .contiguous(), but I don’t think it was ever made robust. Perhaps bmm should be changed to detect this situation and call .contiguous() on the tensor?
@albanD do you think it makes sense to open a github issue to discuss this?

albanD · June 18, 2019, 7:25am

Yes I think that at least a good looking error should be raised if we don’t want to handle this case, telling the user that a call to ‘.contiguous()’ is necessary.

hoaxingz · June 18, 2019, 7:40am

Just wondering, is there any way to leverage the memory efficiency of unfold if I want to do matrix multiplication after? Even though arbitrary strides resulting from unfold can be very memory efficient, it seems that operations like mm and bmm cannot take them in; moreover, einsum and tensordot call mm and bmm (I think they also call contiguous maybe, which bloats memory); moreover, even if all of those were to work, in the backward phase the gradient matrix would be huge anyways…