I asked a similar question about numpy in stackoverflow, but since I’ve discovered the power of the GPU since, I can’t go back there.

So I have a 3D tensor representing a list of matrices, e.g.:

```
In [112]: matrices
Out[112]:
(0 ,.,.) =
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
(1 ,.,.) =
5 0 0 0 0
0 5 0 0 0
0 0 5 0 0
0 0 0 5 0
0 0 0 0 5
[torch.cuda.FloatTensor of size 2x5x5 (GPU 0)]
```

and a 2D tensor representing a list of vectors, e.g.:

```
In [113]: vectors
Out[113]:
1 1
1 1
1 1
1 1
1 1
[torch.cuda.FloatTensor of size 5x2 (GPU 0)]
```

… and I need element-wise, gpu-powered dot product of these two tensors.

I would expect to be able to use `torch.bmm`

here but I cannot figure out how, especially I don’t understand why this happens:

```
In [114]: torch.bmm(matrices, vectors.permute(1,0))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-114-e348783370f7> in <module>()
----> 1 torch.bmm(matrices, vectors.permute(1,0))
RuntimeError: out of range at /py/conda-bld/pytorch_1490979338030/work/torch/lib/THC/generic/THCTensor.c:23
```

… when `matrices[i] @ vectors.permute(1,0)[i]`

work for any `i < len(matrices)`

.

Thanks for your help…