How to make masked Embedding gradients more stable?

I am experimenting with the code that accompanied the paper “Regularizing and Optimizing LSTM Language Models” (Merity et al. 2018): In it, they apply dropout on the embedding. I was wondering if it was possible to make the implementation a bit more conservative on memory (it creates an embedding-sized tensor in the process). I managed to do that, but when I put my implementation in place of theirs, I didn’t get the same numbers as them from the second iteration on.

(A simplified version of) their implementation:

def embedded_dropout(embed, words, dropout=0.1):
    mask =, 1)).bernoulli_(1 - dropout).expand_as(embed.weight) / (1 - dropout)
    masked_embed_weight = mask * embed.weight
    return torch.nn.functional.embedding(words, masked_embed_weight)


def embedded_dropout(embed, words, dropout=0.1):
    emb = embed(words)
    mask =, 1)).bernoulli_(1 - dropout) / (1 - dropout)
    m = torch.gather(mask.expand(embed.weight.size(0), words.size(1)), 0, words)
    return (emb * m.unsqueeze(2).expand_as(emb)).squeeze()

The outputs of the functions are identical. However, the gradients of embed (after going through a few LSTM layers + softmax) differ by 2.0763528e-19 after the first iteration, and then it gets progressively worse, to about 0.0037 after the 200th.

My questions are:

  1. Am I using something that is not stable w.r.t. the gradient computation?
  2. Is it possible to come up with a solution that obtains the same gradients, but only requires a words-sized mask (as in my code) as opposed to an embedding-sized one (as in the original)?

That is a neat approach!
A small difference (I seem to see ~1e-7 often when using float32) is expected when you compute the same thing in a different way (that is mathematically equivalent).
Are you sure you have the exact same random for the dataloader and dropout? My guess would be that you have different embeddings by when you get significantly different results.

Best regards


Yes, I wouldn’t have wondered if the two functions returned slightly different tensors. But they don’t – only their gradients differ. And you are right, the initial divergence is well in the ballpark of regular float errors. What bugs me is that behind the scenes, the two functions should be doing the same thing; all I am not doing is multiplying the unused vectors in the embedding – which shouldn’t receive gradients anyway.

The random is definitely the same – the number of calls to the random number generator stays the same (embed.weight.size(0)). Even though I’ve changed quite a few things in the code aside from this function, but took extra care not to mess up the random sequence – I even deliberately left a (harmless) bug in the code so that I still get the same numbers.

And you are definitely right, the embeddings diverge by the end, but it is not because the data fed to them is different (that is actually cached, so no randomness is involved there), but because of the cumulative effect of the difference in gradients.

I am wondering if there is a way to see what (symbolic) operations this code translates to. A comparison there might reveal where the difference in gradients comes from.

There isn’t any symbolic code, the functions call various C++ functions for the embeddings, there is the CPU backward for dense matrices, next to it is a version for sparse matrices on CPU. The CUDA code is a bit more elaborate.

I’m not sure the chances are good to find an obvious reason for the discrepancy - essentially the backward would seem to permute summing addition (for multiple occurrences of the same grad) with multiplying some parts with 0.

Best regards


I see. I ran the code until convergence, and the final perplexity is almost the same as with the original function (as expected), so I don’t mind this all that much. It would have been nice to have the same numbers, though.