I’m trying to train a very simple transformer which is supposed to translate sentences to another language and I’m having trouble getting rid of padding index when training my model.
Here is my training loop(simplified for clarity):
for source, target in data: # Source and target Shape [ max_len ]
source, target = source.to(device), target .to(device)
optimizer.zero_grad()
logits = model(source, target ) # Shape [ max_len, vocab_size ]
loss = criterion(logits, labels) # criterion is CrossEntropyLoss
loss.backward()
optimizer.step()
After training, I then do
logits = model(source, target)
and
torch.argmax(logits, dim=1)
which gives me the ids of words of the translated sentence.
An example source sentence:
tensor([ 2, 214, 544, 1392, 306, 708, 168, 743, 103, 145,
839, 16289, 12140, 15, 3, 1, 1, 1, 1, 1])
Target sentence:
tensor([ 2, 217, 273, 13, 5008, 3650, 192, 479, 581, 1788, 5743, 15,
3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
My labels are of shape [ max_len, vocab_size ], and every word is given a probability of .0 except the correct word which is given 1.0. For example, labeling of a random 5 word sentence in a language with a vocab_size of 10 words would look like this:
tensor([[ .0, .0, 1.0, .0, .0, .0, .0, .0, .0, .0]
[ .0, .0, .0, 1.0, .0, .0, .0, .0, .0, .0]
[ .0, .0, .0, .0, .0, 1.0, .0, .0, .0, .0]
[ .0, .0, .0, .0, .0, .0, .0, .0, .0, 1.0]
[ .0, .0, .0, 1.0, .0, .0, .0, .0, .0, .0]])
CrossEntropyLoss then takes the output of my model and the labels of the “target” sentence as input.
The problem is this: Since there are lots of paddings(padding index is 1 in my case), the model eventually tries to translate the sentence into only padding words because they are the majority, thus resulting in a smaller loss.
I tried using the “ignore_index” feature of the CrossEntropyLoss() function like this
CrossEntropyLoss(ignore_index=1)
but it says:
RuntimeError: ignore_index is not supported for floating point target
How can I solve this problem?