In torch.nn.functional.embedding, why does padding_idx exist?

bdusell · March 21, 2024, 4:41pm

I have a question about using padding indexes with input embedding layers. For context, suppose I am training a causally masked transformer language model, where sequences are always left-aligned in a batch, with padding on the right.

In torch.nn.functional.embedding, why does the argument padding_idx exist? As best I can tell, it’s unnecessary. Its functionality is to (1) always map padding_idx to the 0 vector and (2) ensure that its embedding always has a gradient of 0. But if the output is properly masked (e.g., with ignore_index in torch.nn.functional.cross_entropy ), then the output of the model at that timestep will be ignored, and the parameters used in computation at that timestep, including the embedding, will receive 0 gradient. So you could just use an arbitrary index for padding, say 0, and avoid having spurious embedding parameters for the padding symbol. I wrote a unit test to verify that the behavior is the same whether you (1) set padding_idx to a unique value, or (2) set it to None and use 0 in padded positions, and the results are the same after optimizing for a few steps with Adam.

Is there any benefit to using padding_idx vs. the alternative I described? A downside of using padding_idx is that it forces you to have a spurious embedding vector for the padding symbol, which is not used for useful computation and not truly a parameter of the model.

vdw · March 22, 2024, 8:59am

Firstly, you might have a dataset from “somewhere else” where the index of the padding token is not 0.

Secondly, not all training setups support the masking of the loss. For example, say you want to train an RNN/LSTM/GRU-based sentiment classifier. If you have batches with sequences of different lengths that need padding – and you don’t use packing – you want to treat the padding as a special word the classifier will hopefully learn to ignore.

In any case, since you train a classifier, i.e., a many-to-one model, you can not mask the passing tokens when calculating the loss between the output and the ground-truth class labels.

That being said, if your setup allows for an alternative solution, you can naturally go for it :).