Why would residual networks over non-linearities prevent exploding gradients?

We often sees residual connections in today’s networks, be it in ResNet or in Transformer.

The code above can be visualised as the encoder part here:

From what I have read, residual connections help prevent exploding/vanishing gradients because skip connections can “skip passed the non-linear activation functions”, which means that gradients enjoy the same benefit. I do not understand what that means.

In the backward pass, why would PyTorch skip the non-linearities? How does it know to skip those? So, in this fictional example:

src2 = self.activation(self.linear1(src))
src = src + src2

what is the advantage of using the skip connection? Doesn’t the backward pass flow through all operations?

I feel that I am missing an important part of the puzzle, but I can’t figure out which one.

In srcOut = src1 + src2

The key is that [loss] gradient w.r.t. srcOut is passed to both summands unchanged. As a result, any block or partial sum could in theory learn to produce the best srcOut. Thus, later blocks learn residuals.

Contrast this with function composition: srcOut = fc2(act(fc1(src))). Here you’ll have a chain of intermediate results, and a chain of multiplications will be applied to the initial gradient dLoss/dSrcOut.