Why is the U-net not translationally invariant?

I have a general question that I was hoping for some pointers to information on.
Briefly, I thought a U-net was invariant to translation, so I made a simple simulation placing squares at random locations, but only labelling the squares if in the left side of the image. I put 10 pixel squares in a 128 pixel image. Used 1 max pool step only, so lowest feature map matrix size is 64x64.
I was expecting the Unet NOT to be able to fit this problem well, but turns out it did. Even with a single max-pool level it can successfully neglect squares on the right side of the image.
How is this possible? It’s conv + max pool + bilinear upsampling only. All are local operations and translation invariant.
Inspection of the activation layers, seems to indicate that the encoding arm is translation invariant (feature maps simply move when input moves, intensity is constant). Output in the decoding arm vary in intensity depending on translation and seems to be responsible for the good fit. I just don’t understand how convolutions with 3x3 filters are able to produce said dependence on location.
What am I missing?


Turns out I used align_corners=True in the upsampling step. This adds a location dependent phase shift to the upsampled image relative to decoder arm. Enough for the network to create filters to exploit that apparently… align_corners=False leaves it invariant to translations. Nice fig here:

So I guess it may in general be a bad idea to use align_corners=True if you want translational equivariance!