I’m trying to find a more low level pyTorch method to run the code below, as it’s really slow now with the two Python-based loops. It’s essentially a sort of “expansion” of A[] by a sliding window from B…
for i in range(100):
for j in range(100):
corr[i, j, :, : ] = A[i, j] * B[ i:i+25, j:j+25 ]
(the numbers 100 or 25 are just placeholders)
I’ve been stuck on this for a while and every time I find some primitive that seems it could help, it has some limitation that makes it NOT fit the problem
Thanks for the unfold trick, I tested this and it gave a speedup of 39x I was thinking earlier about some way of expanding B so it would match the topology of the corr[] result, but couldn’t find the method.
I guess it does require a bunch of passes over the data, because of the multiple chained unfolds, so perhaps it could be made faster if there was indeed a CUDA primitive that did it all in one pass, but I’ll be able to search for that later when this become a bottleneck again
BTW, in a special case, A and B are the same array, and in that case it looks eerily similar to some kind of 2D auto-correlation (of a limited offset size), something you’d expect there to be a primitive for somewhere maybe.