Why use dropout in positional encoding layer

class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)
1 Like

Any update here? I am trying to understand the same

Dropout is a type of regularization. The final embedding for each token that you use (for the transformer) is a sum of positional and standard embeddings and then they apply dropout to that sum. So dropout is applied to the sum of the standard embedding and the positional embedding, not just the (constant) positional embedding. This sum is then an embedding, a bunch of parameters, and dropout is used to regularize as is usual. Are you looking here:

Let me know if I’m missing something.

Hi Andrei, thanks for replying
To me is still unclear why use the dropout layer on the sum of the 2 (PEs+EMBEDS) instead of applying dropout to embeds and then summing PEs. I interpreted it as a way they use to somehow introduce a bit of uncertainty to the model over positions but not clear why…surely after attempts this showed to work better but I don’t get why…am I missing something?

So usually the “embedding” of a word is the embedding that’s used for that token. In this case, the embedding is the parametric embedding + the constant positional encoding. When you apply dropout to a neuron, you kill the entire neuron. So if you have a sequence of length 10 and each token has 512 dimensional vectors, you kill on average 60% of the neurons in the 10 by 512 matrix that represents the data. If you only did this to the parametric embeddings and not the positional ones, you would not kill a neuron, you’d leave in its positional information, so it’s not really dropout.

The proof in this type of stuff however is in the data: you can try this experiment - the above is not really a theorem or anything, just my rationalization.

1 Like

Ahhhh that makes sense! Since dropout works by zeroing the neurons, you have to apply to the sum of them! Thanks a lot!
nevertheless as you said, they likely tried both ways and discovered that this one works better (I’d love to experiment it myself but I don’t have a gpu)