Intuition for residual connections in Transformer layers?

I was reading through the implementation of TransformerEncoderLayer where we find the following:

As you can see, we sum the output of self attention with the original input as a residual connection. And even further, we sum that value with the output of the feedforward network. Why is that? Why do we always sum the input to the output of some component (self-attention and feedforward network in this case), rather than just taking the output? Is this a form of regularization?

1 Like

looks like residual layers (i.e. y = x + f(x) is trained instead of y = f(x))

Yes, you are right of course. I am wondering what the intuition behind that is in this case. I only know residual connections from ResNet and friends, so I am not sure how to interpret their importance in the Transformer case (typical for text).

The original attention paper uses residual connections, without much explanations. I’d say it is the same motivation anywhere, combating vanishing gradients and hopefully easier training as less shifting is needed.

I guess pretty much what they do in CV:
Skip connections or residual connections are used to allow gradients to flow through a network directly, without passing through non-linear activation functions. Non-linear activation functions, by nature of being non-linear, cause the gradients to explode or vanish (depending on the weights). Skip connections form conceptually a ‘bus’ which flows right the way through the network, and in reverse, the gradients can flow backwards along it too.