Fast partial matrix multiplication

I’d like to multiply matrices (or tensors) A and B to get a matrix C, but I only need the results in some neighborhood of the diagonal of C. For example, for a square matrix C with column length n, I want only the entries C[i,j] such that i>j>max(i-k,0) for a fixed k<<n. This is like 0-d convolution without weight sharing. For any such partial matrix multiplication, a naive way is to expand a dimension of one of the matrices with a custom index matrix, take a Hadamard product of it with another matrix and take a reduce_sum over a dimension. For example,

c = (a.unsqueeze(2)*b).sum(-1)

For certain tensors, with CPU this took 2.4 sec, whereas calculating the whole matrix entries with torch.bmm took 0.22 sec. The first line above takes 1.4 sec, so I don’t think the difference will vanish with the use of a GPU. What should I do? Does torch.sparse help?