Batch element-wise dot-product of matrices and vectors

I asked a similar question about numpy in stackoverflow, but since I’ve discovered the power of the GPU since, I can’t go back there.

So I have a 3D tensor representing a list of matrices, e.g.:

In [112]: matrices
Out[112]: 

(0 ,.,.) = 
  1  0  0  0  0
  0  1  0  0  0
  0  0  1  0  0
  0  0  0  1  0
  0  0  0  0  1

(1 ,.,.) = 
  5  0  0  0  0
  0  5  0  0  0
  0  0  5  0  0
  0  0  0  5  0
  0  0  0  0  5
[torch.cuda.FloatTensor of size 2x5x5 (GPU 0)]

and a 2D tensor representing a list of vectors, e.g.:

In [113]: vectors
Out[113]: 

 1  1
 1  1
 1  1
 1  1
 1  1
[torch.cuda.FloatTensor of size 5x2 (GPU 0)]

… and I need element-wise, gpu-powered dot product of these two tensors.

I would expect to be able to use torch.bmm here but I cannot figure out how, especially I don’t understand why this happens:

In [114]: torch.bmm(matrices, vectors.permute(1,0))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-114-e348783370f7> in <module>()
----> 1 torch.bmm(matrices, vectors.permute(1,0))

RuntimeError: out of range at /py/conda-bld/pytorch_1490979338030/work/torch/lib/THC/generic/THCTensor.c:23

… when matrices[i] @ vectors.permute(1,0)[i] work for any i < len(matrices).

Thanks for your help…

Oh I’ve just found out something that works: torch.bmm(matrices,vectors.permute(1,0).unsqueeze(2)).squeeze().permute(1,0).

So I have another question: is there any way to avoid these permutes and [un]squeeze? Should I organize my arrays differently?

There’s no way to avoid the permute calls, although you can use .t() (transpose) instead of permute(1, 0). Typically we have the left-most dimension as the batch dimension.

Greg is working on NumPy-style broadcasting which will make the unsqueeze calls unnecessary.