I have a general question that I was hoping for some pointers to information on.
Briefly, I thought a U-net was invariant to translation, so I made a simple simulation placing squares at random locations, but only labelling the squares if in the left side of the image. I put 10 pixel squares in a 128 pixel image. Used 1 max pool step only, so lowest feature map matrix size is 64x64.
I was expecting the Unet NOT to be able to fit this problem well, but turns out it did. Even with a single max-pool level it can successfully neglect squares on the right side of the image.
How is this possible? It’s conv + max pool + bilinear upsampling only. All are local operations and translation invariant.
Inspection of the activation layers, seems to indicate that the encoding arm is translation invariant (feature maps simply move when input moves, intensity is constant). Output in the decoding arm vary in intensity depending on translation and seems to be responsible for the good fit. I just don’t understand how convolutions with 3x3 filters are able to produce said dependence on location.
What am I missing?