Why does PyTorch's Transformer model implementation `torch.nn.Transformer` have an additional LayerNorm layer on Encoder/Decoder 's output?

In the code of torch.nn.Transformer (torch.nn.modules.transformer — PyTorch 2.0 documentation), the encoder part is implemented as follows:

encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
                                        activation, layer_norm_eps, batch_first, norm_first,
encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

From torch.nn.TransformerEncoder (torch.nn.modules.transformer — PyTorch 2.0 documentation) 's code we can see the output is passed through the above LayerNorm layer:

if self.norm is not None:
    output = self.norm(output)

return output

My questions is: why do we need to add another LayerNorm, considering we have LayerNorm layers applied to self-attention and FFNN respectively already? What’s the benefit? Is it necessary?

torch.nn.TransformerEncoderLayer (torch.nn.modules.transformer — PyTorch 2.0 documentation) 's output:

    x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask, is_causal=is_causal))
    x = self.norm2(x + self._ff_block(x))

return x

Sorry for not having an answer!

However, asking for the purpose of a layer is generally a tricky question. For example, there a whole papers trying to figure out the purpose of individual components of the Transformer (e.g. just for the FF layer: Transformer Feed-Forward Layers Are Key-Value Memories).

If you check the original Transformer paper (Attention is all you need), it’s mainly a description of the architecture and components, without any real theoretical justifications.