I am currently reading the resnet paper, and I noticed their residual blocks always contain two convolutions. I see the first convolution is used to map the input channels to the desired channel number of the residual block (if there is a change in channel dimensions between subsequent residual blocks), while the second convolution keeps the channel dimension fixed. When there is no change in dimension between two residual blocks, both keep the input channel dimension fixed. I was wondering, why is it actually necessary to have two convolutions in the same block? One could just have a single convolution:
Is the reason for having blocks with two (or more) layers just a design choice, or is there a specific reason to use at least two conv layers (and not one) per residual block?
Skip connections are just elementwise additions and computationally cheap, therefore, I can’t imagine that computational effort would be the explanation.
Right, if you are only concerned about getting the shape right a single layer would do it. However, the actual processing of multiple layers with non-linearities between them would not be the same, so you might lose the actual training properties of these blocks.
This means that when stride s > 1, just every s’th entry of the input x is “transferred” via the skip connection. Isn’t this a loss of valuable information? Is this what actually happens? Because then it could be better to use a pooling function like avgpooling with stride=s before the conv1x1 to take care of the downsampling and then perform the conv1x1 with stride=1?