Hi, I’m looking for a way to accelerate the calculation of a sort of “correlation” between two tensors, over local regions.

Think of it like a sort of a alternative to the Conv2DLocal operation (convolutional layer with non-shared weights with a kernel of a small size).

So if I have an A 3x3 tensor and a B 3x3 tensor, and a 3x3 kernel size, I want to do an operation that gives me O 3x3 x 3x3 outputs, i.e. for each of the 3x3 elements in A, multiply it with all 3x3 elements of B and put the 9 results in one of the 3x3 slots in the O output, then repeat over A.

I’m sorry this is much more difficult to describe in words than to write the (slow) sequential code.

It’s for testing correlation-based weight updates between layers A and B (in this example).