Please help me reduce this slow loop into primitives

I’m trying to find a more low level pyTorch method to run the code below, as it’s really slow now with the two Python-based loops. It’s essentially a sort of “expansion” of A[] by a sliding window from B…

for i in range(100):
        for j in range(100):
            corr[i, j, :, : ] = A[i, j] * B[ i:i+25, j:j+25 ]

(the numbers 100 or 25 are just placeholders)

I’ve been stuck on this for a while and every time I find some primitive that seems it could help, it has some limitation that makes it NOT fit the problem :slight_smile:

Please help me Obi-Wan, you’re my only hope…

You could use unfold to achieve this:

A = torch.ones(10, 10) * 2
B = torch.arange(14*14).view(14, 14)
corr = torch.zeros(10, 10, 5, 5)

# Slow method
for i in range(10):
    for j in range(10):
        corr[i, j, :, :] = A[i, j] * B[i:i+5, j:j+5]

# Hopefully faster
corr2 = torch.zeros(10, 10, 5, 5)
D = B.unfold(0, 5, 1).unfold(1, 5, 1)
corr2 = D * A.unsqueeze(2).unsqueeze(3)

(corr == corr2).all()
print(corr[0, 0])
print(corr2[0, 0])

Thanks for the unfold trick, I tested this and it gave a speedup of 39x :slight_smile: I was thinking earlier about some way of expanding B so it would match the topology of the corr[] result, but couldn’t find the method.

I guess it does require a bunch of passes over the data, because of the multiple chained unfolds, so perhaps it could be made faster if there was indeed a CUDA primitive that did it all in one pass, but I’ll be able to search for that later when this become a bottleneck again :slight_smile:

BTW, in a special case, A and B are the same array, and in that case it looks eerily similar to some kind of 2D auto-correlation (of a limited offset size), something you’d expect there to be a primitive for somewhere maybe.

for i in range(1024):
    for j in range(1024):
        val= matrix[i,j]

This is taking alot of time.