I have a question about using padding indexes with input embedding layers. For context, suppose I am training a causally masked transformer language model, where sequences are always left-aligned in a batch, with padding on the right.
In torch.nn.functional.embedding, why does the argument padding_idx
exist? As best I can tell, it’s unnecessary. Its functionality is to (1) always map padding_idx
to the 0 vector and (2) ensure that its embedding always has a gradient of 0. But if the output is properly masked (e.g., with ignore_index
in torch.nn.functional.cross_entropy
), then the output of the model at that timestep will be ignored, and the parameters used in computation at that timestep, including the embedding, will receive 0 gradient. So you could just use an arbitrary index for padding, say 0, and avoid having spurious embedding parameters for the padding symbol. I wrote a unit test to verify that the behavior is the same whether you (1) set padding_idx
to a unique value, or (2) set it to None
and use 0 in padded positions, and the results are the same after optimizing for a few steps with Adam.
Is there any benefit to using padding_idx
vs. the alternative I described? A downside of using padding_idx
is that it forces you to have a spurious embedding vector for the padding symbol, which is not used for useful computation and not truly a parameter of the model.