Ignore_index in the cross entropy loss


Hi. I think Pytorch calculates the cross entropy loss incorrectly while using the ignore_index option.
The problem is that currently when specifying the ignore_index (say, = k), the function just ignores the value of the target y = k (in fact, it calculates the cross entropy at k but returns 0) but it still makes full use of the logit at index k to calculate the normalization term for other indices not equal to k. I think this is not intended use for most users.

For example, in variable length sequences, people pad the sequence and use the ignore_index as the pad target index in order to avoid considering the padded values (both from inputs and targets). If there are n classes, you have to prepare (n+1) classes for the logit dimension (input of the cross entropy loss) to include the pad class and then ignore it by using the ignore_index option.

Here is some illustrative example:

# Test cross entropy loss: first create data
x = torch.log(torch.tensor([[2,3,4]],dtype=torch.float)) #  a vector of 3-class logits (one of them could be a padding class)
y1 = torch.tensor([0],dtype=torch.long)
y2 = torch.tensor([1],dtype=torch.long)
y3 = torch.tensor([2],dtype=torch.long)

# calculate the negative logsoftmax for each logit index for comparison
-torch.nn.functional.log_softmax(x,dim=1) # returns tensor([1.5041, 1.0986, 0.8109]) 

# perform logsoftmax and NLL loss at the same time (not use ignore_index yet)
print(torch.nn.functional.cross_entropy(x,y1)) # 1.5041 
print(torch.nn.functional.cross_entropy(x,y2)) # 1.0986 
print(torch.nn.functional.cross_entropy(x,y3)) # 0.8109 

# Now let's ignore the index 0 and find cross entropy loss for index 1
print(torch.nn.functional.cross_entropy(x,y2,ignore_index=0)) # get 1.0986 
# this is the same value as when not excluding the index 0 from the logit; 
# It should ignore the index since the level of the logit index, not just final target index.

# Next let's calculate the correct cross entropy loss when you actually ignore the index 0 completely from both x and y
x_ignore = x[0][1:].view(1,x.shape[-1]-1) # Now we have logits of 2 classes
# the index that is more than the ignore index is decreased by 1
y2_ignore = torch.tensor([0],dtype=torch.long)
y3_ignore = torch.tensor([1],dtype=torch.long)
# cross entropy with ignore_index 0 for the index 1 (which now becomes index 0)
print(torch.nn.functional.cross_entropy(x_ignore,y2_ignore)) # get 0.8473

In conclusion, I raise this issue in case developers may consider to revise this ignore_index option, but if the current one already follows the intended use (ignore only y not x; hence allowing to backprop through the ignored index of the logit in the normalization term of the softmax), it would be my misunderstanding of how it should work (ignore both x and y at the ignore_index).

(Adrián Javaloy) #2

I think that’s not what it means the ignore_index parameter. its description says:

Specifies a target value that is ignored and does not contribute to the input gradient. When size_average is True , the loss is averaged over non-ignored targets.

So what you specify with ignore_index = k is that the elements of the target that has value k will not contribute to the error. And if you specify size_average = True then the average won’t count those elements either.

Its intended use, or at least the way I use them, is on the cases when you add padding to your input in order to have the same length on all the instances.


@adrianjav Yes, the target that has index k will not contribute to the error. But the problem is that the class k at the softmax layer is not ignored when calculating the softmax for other classes (the index k still appears in the denominator of the softmax formula since Pytorch did not drop it).

For example, you have only 2 classes: class 1, and class 2 (your padding class). So when you ignore the padding class, the softmax probability of the class 1 must always be one (because there is only one class to consider) but if you try to use ignore_index option, it will not return 1 in general since it still did not eliminate the padding class from consideration (and also there is a chance that it will give higher probability on the padding class for unseen data.)