Hey all, we have two tensors. The first is of size `(S1, N)`

and the other is `(N, S2)`

. Both are considerably large matrices, with `S1 = ~200k`

, `N = 10000`

and `S2 = ~200k`

Requirement is to do dot product of a few rows (a couple 10s) of the first tensor with the complete second tensor. The few rows are given by a mask, `mask`

. And there are many such masks for which we need to calculate the dot product.

Right now, if we’re doing.

```
for mask in masks:
t = S1[mask]
result = torch.matmul(t, S2)
```

But it’s very slow.

We found that indexing through a tensor actually copies the result as opposed to indexing through an integer or a slice which returns a view.

We guess that `S1[mask]`

creates another tensor which seems to be the bottleneck since `S1`

is a pretty large matrix.

What are our options?

Sparse matrices? Writing custom CUDA code? Any other way?

Thanks!