When I use mm or bmm on a cuda tensor with custom strides (either obtained through unfold or as_strided), cublas runtime error occurs. The operations run correctly when device = cpu.
On entry to SGEMM parameter number 8 had an illegal value
RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at /pytorch/aten/src/THC/THCBlas.cu:441
On entry to SGEMM parameter number 8 had an illegal value
RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at /pytorch/aten/src/THC/THCBlas.cu:258
Version:
pytorch 1.0.0
cuda 10.1
gpu Titan Xp or GTX 1080
I guess the fact that both dimensions on which the mm should be done have a stride of 1 trips the logic used to decide whether to transpose or not here. @ngimel can you confirm this?
No transposes would help, the sizes of unf1 are (10,5,5) and the strides are (9,1,1), and for lda/ldb blas standard specifies (I’m quoting for lda, but ldb is similar modulo permutation of m,n,k and transpose value)
LDA - INTEGER.
On entry, LDA specifies the first dimension of A as
declared in the calling (sub) program. When TRANSA =
'N' or 'n' then LDA must be at least max( 1, m ),
otherwise LDA must be at least max( 1, k ).
Unchanged on exit.
so, at least one of the last strides should be 5 for pytorch to be able to correct situation with setting appropriate transposes.
Using self-overlapping tensors (like this one) on cuda is not the greatest idea anyway. Pytorch tries to detect it in some places and throw errors or call .contiguous(), but I don’t think it was ever made robust. Perhaps bmm should be changed to detect this situation and call .contiguous() on the tensor? @albanD do you think it makes sense to open a github issue to discuss this?
Yes I think that at least a good looking error should be raised if we don’t want to handle this case, telling the user that a call to ‘.contiguous()’ is necessary.
Just wondering, is there any way to leverage the memory efficiency of unfold if I want to do matrix multiplication after? Even though arbitrary strides resulting from unfold can be very memory efficient, it seems that operations like mm and bmm cannot take them in; moreover, einsum and tensordot call mm and bmm (I think they also call contiguous maybe, which bloats memory); moreover, even if all of those were to work, in the backward phase the gradient matrix would be huge anyways…