The result of [3, 3] shaped tensor matmul with [3, 16, 1080] was expected to yield [3, 16, 1080] shaped tensor.

However, I get a runtime error.

Is this the wrong implementation? Or am I missing something?

```
import numpy as np
import torch
X = torch.from_numpy(np.random.randn(3,3)).cuda().float()
Y = torch.from_numpy(np.random.randn(3,16,1080)).cuda().float()
print(X.shape, Y.shape)
print(X.dtype, Y.dtype)
X @ Y
```

```
torch.Size([3, 3]) torch.Size([3, 16, 1080])
torch.float32 torch.float32
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-14-c2c71b1687e7> in <module>
6 print(X.shape, Y.shape)
7 print(X.dtype, Y.dtype)
----> 8 X @ Y
RuntimeError: mat1 and mat2 shapes cannot be multiplied (3240x16 and 3x3)
```