Hey all, we have two tensors. The first is of size (S1, N)
and the other is (N, S2)
. Both are considerably large matrices, with S1 = ~200k
, N = 10000
and S2 = ~200k
Requirement is to do dot product of a few rows (a couple 10s) of the first tensor with the complete second tensor. The few rows are given by a mask, mask
. And there are many such masks for which we need to calculate the dot product.
Right now, if we’re doing.
for mask in masks:
t = S1[mask]
result = torch.matmul(t, S2)
But it’s very slow.
We found that indexing through a tensor actually copies the result as opposed to indexing through an integer or a slice which returns a view.
We guess that S1[mask]
creates another tensor which seems to be the bottleneck since S1
is a pretty large matrix.
What are our options?
Sparse matrices? Writing custom CUDA code? Any other way?
Thanks!