Activation functions in nn.TransformerEncoderLayer

Hi all,

I am currently debugging a transformer’s encoder, as it does not learn as expected. I have registered a number of forward-hooks in order to check outputs of different steps in the forward pass. I would like to check if I have tons of dead-relus, but can not figure out where the activation is located. A vanilla nn.TransformerEncoderLayer has the following structure:

(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)

Should there not be an activation layer between linear1 and linear2 in the feed-forward block? That seems to be the case when reading Attention Is All You Need, page 5, equation (2). I have looked at the source-code, but am unable to figure out where the activations happen in the pyTorch-implementation.

Does anyone know where to place the forward-hook for checking for dead-relus in this implementation?