Based on your description this would be similar to a convolution (or correlation, if you flip one kernel), which would be possible to apply using F.conv2d
and two inputs.
However, I’m a bit confused by the shapes.
If we scale down the problem a bit and assume that both feature maps have a single channel, the correlation output would have the shape [h1 + 2*h2 -2, w1 + 2*w2 -2]
, of you perform a full correlation, wouldn’t it? These output values seem to correspond to the N
correlation values you mentioned. In the last step you mention that you would repeat this step over all pixels, whihc is unclear to me.
Could you post a pseudo code or a dummy example in PyTorch or numpy?