How to properly use ignore_index when training a transformer

I’m trying to train a very simple transformer which is supposed to translate sentences to another language and I’m having trouble getting rid of padding index when training my model.

Here is my training loop(simplified for clarity):

for source, target in data: # Source and target Shape [ max_len ]
    source, target = source.to(device), target .to(device)
    optimizer.zero_grad()
    logits = model(source, target ) # Shape [ max_len, vocab_size ]
    loss = criterion(logits, labels) # criterion is CrossEntropyLoss
    loss.backward()
    optimizer.step()

After training, I then do
logits = model(source, target)
and
torch.argmax(logits, dim=1)
which gives me the ids of words of the translated sentence.

An example source sentence:

tensor([    2,   214,   544,  1392,   306,   708,   168,   743,   103,   145,
           839, 16289, 12140,    15,     3,     1,     1,     1,     1,     1])

Target sentence:

tensor([   2,  217,  273,   13, 5008, 3650,  192,  479,  581, 1788, 5743,   15,
            3,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1])

My labels are of shape [ max_len, vocab_size ], and every word is given a probability of .0 except the correct word which is given 1.0. For example, labeling of a random 5 word sentence in a language with a vocab_size of 10 words would look like this:

tensor([[     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0,     .0]
        [     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0]
        [     .0,     .0,     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0]
        [     .0,     .0,     .0,     .0,     .0,     .0,     .0,     .0,     .0,    1.0]
        [     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0]])

CrossEntropyLoss then takes the output of my model and the labels of the “target” sentence as input.

The problem is this: Since there are lots of paddings(padding index is 1 in my case), the model eventually tries to translate the sentence into only padding words because they are the majority, thus resulting in a smaller loss.

I tried using the “ignore_index” feature of the CrossEntropyLoss() function like this
CrossEntropyLoss(ignore_index=1) but it says:

RuntimeError: ignore_index is not supported for floating point target

How can I solve this problem?

As the error message explains, the ignore_index argument is invalid for (the new) floating point targets.
Assuming you don’t want to use “soft” target values and could use class indices, you could transform the current target to class indices via:

target = torch.tensor([[     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0,     .0],
                       [     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0],
                       [     .0,     .0,     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0],
                       [     .0,     .0,     .0,     .0,     .0,     .0,     .0,     .0,     .0,    1.0],
                       [     .0,     .0,     .0,    1.0,     .0,     .0,     .0,     .0,     .0,     .0]])
target = target.argmax(dim=1)
print(target)
# tensor([2, 3, 5, 9, 3])

which would not only save memory but would also allow you to use ignore_index properly.

1 Like

Thanks for the reply, my loss function now works properly.

I wanted to ask one more thing: does CrossEntropyLoss(logits, target) realize that when logits is of shape [ max_len, vocab_size ], the max_len dimension does not represent batches but rather the number of words in my sentence?

In the documentation, it expects logits to be of shape (N, C), however, here N represents number of batches, which I don’t have because I work with only one sentence at a time.

No, the loss function doesn’t understand any difference between number of samples in the batch dimension or number of sentences and will treat dim0 as the dimension containing different samples.

1 Like

Thanks for the reply, does that mean that there is no way for me to train my model with batches? My input then would have to be of shape [ batch_size, max_len, vocab_size ] which is from my understanding not an acceptable input shape for the loss function. Is there a way around this?

No, sorry this is not what I meant.
In my previous answer I assumed you asked if the loss function would understand the difference between a model output in the shape [batch_size, nb_classes] and [max_len, nb_classes], which is not the case.
nn.CrossEntropyLoss accepts multi-dimensional data, in your case temporal data in the shape [batch_size, nb_classes, sequence_length] as the model output and expects targets in the shape [batch_size, sequence_length] containing class indices in the range [0, nb_classes-1].
You might thus need to .permute your model output to create the expected shape.

1 Like