Intuition for residual connections in Transformer layers?

BramVanroy · October 19, 2020, 10:00am

I was reading through the implementation of TransformerEncoderLayer where we find the following:

pytorch/pytorch/blob/0c5cd8c2b9cdf473e30bbb1b49ca80ed442813df/torch/nn/modules/transformer.py#L282-L300


def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
    r"""Pass the input through the encoder layer.

    Args:
        src: the sequence to the encoder layer (required).
        src_mask: the mask for the src sequence (optional).
        src_key_padding_mask: the mask for the src keys per batch (optional).

    Shape:
        see the docs in Transformer class.
    """
    src2 = self.self_attn(src, src, src, attn_mask=src_mask,
                          key_padding_mask=src_key_padding_mask)[0]
    src = src + self.dropout1(src2)
    src = self.norm1(src)
    src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
    src = src + self.dropout2(src2)
    src = self.norm2(src)
    return src

As you can see, we sum the output of self attention with the original input as a residual connection. And even further, we sum that value with the output of the feedforward network. Why is that? Why do we always sum the input to the output of some component (self-attention and feedforward network in this case), rather than just taking the output? Is this a form of regularization?

googlebot · October 19, 2020, 10:58am

looks like residual layers (i.e. y = x + f(x) is trained instead of y = f(x))

BramVanroy · October 19, 2020, 12:16pm

Yes, you are right of course. I am wondering what the intuition behind that is in this case. I only know residual connections from ResNet and friends, so I am not sure how to interpret their importance in the Transformer case (typical for text).

googlebot · October 19, 2020, 12:35pm

The original attention paper uses residual connections, without much explanations. I’d say it is the same motivation anywhere, combating vanishing gradients and hopefully easier training as less shifting is needed.

Abhilash_Srivastava · October 20, 2020, 8:59am

I guess pretty much what they do in CV:
Skip connections or residual connections are used to allow gradients to flow through a network directly, without passing through non-linear activation functions. Non-linear activation functions, by nature of being non-linear, cause the gradients to explode or vanish (depending on the weights). Skip connections form conceptually a ‘bus’ which flows right the way through the network, and in reverse, the gradients can flow backwards along it too.