Why would residual networks over non-linearities prevent exploding gradients?

BramVanroy · November 9, 2020, 1:26pm

We often sees residual connections in today’s networks, be it in ResNet or in Transformer.

pytorch/pytorch/blob/0c5cd8c2b9cdf473e30bbb1b49ca80ed442813df/torch/nn/modules/transformer.py#L282-L300


def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
    r"""Pass the input through the encoder layer.

    Args:
        src: the sequence to the encoder layer (required).
        src_mask: the mask for the src sequence (optional).
        src_key_padding_mask: the mask for the src keys per batch (optional).

    Shape:
        see the docs in Transformer class.
    """
    src2 = self.self_attn(src, src, src, attn_mask=src_mask,
                          key_padding_mask=src_key_padding_mask)[0]
    src = src + self.dropout1(src2)
    src = self.norm1(src)
    src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
    src = src + self.dropout2(src2)
    src = self.norm2(src)
    return src

The code above can be visualised as the encoder part here:

From what I have read, residual connections help prevent exploding/vanishing gradients because skip connections can “skip passed the non-linear activation functions”, which means that gradients enjoy the same benefit. I do not understand what that means.

In the backward pass, why would PyTorch skip the non-linearities? How does it know to skip those? So, in this fictional example:

src2 = self.activation(self.linear1(src))
src = src + src2

what is the advantage of using the skip connection? Doesn’t the backward pass flow through all operations?

I feel that I am missing an important part of the puzzle, but I can’t figure out which one.

googlebot · November 9, 2020, 11:07pm

In srcOut = src1 + src2

The key is that [loss] gradient w.r.t. srcOut is passed to both summands unchanged. As a result, any block or partial sum could in theory learn to produce the best srcOut. Thus, later blocks learn residuals.

Contrast this with function composition: srcOut = fc2(act(fc1(src))). Here you’ll have a chain of intermediate results, and a chain of multiplications will be applied to the initial gradient dLoss/dSrcOut.