Turns out I used align_corners=True in the upsampling step. This adds a location dependent phase shift to the upsampled image relative to decoder arm. Enough for the network to create filters to exploit that apparently… align_corners=False leaves it invariant to translations. Nice fig here:
So I guess it may in general be a bad idea to use align_corners=True if you want translational equivariance!