From what I have read, residual connections help prevent exploding/vanishing gradients because skip connections can “skip passed the non-linear activation functions”, which means that gradients enjoy the same benefit. I do not understand what that means.
In the backward pass, why would PyTorch skip the non-linearities? How does it know to skip those? So, in this fictional example:
The key is that [loss] gradient w.r.t. srcOut is passed to both summands unchanged. As a result, any block or partial sum could in theory learn to produce the best srcOut. Thus, later blocks learn residuals.
Contrast this with function composition: srcOut = fc2(act(fc1(src))). Here you’ll have a chain of intermediate results, and a chain of multiplications will be applied to the initial gradient dLoss/dSrcOut.