ResnetBlocks/Skip Connections only in deepest layers

Recently I see a lot of convolutional encoder/decoder architectures which only after downsampling within several layers, appernd resnet blocks. In other words, instead of using skip connection from the beginning like in traditional ResNet architectures those are only used when the spatial extend of the input has already been shrunk by a factor of 4 or 8. Surprisingly the result quality seems to be on par if not succeeding the skip-connections all the way approach. Is there some intuition behind this or can someone shed some light on this for me?