Hi. I think Pytorch calculates the cross entropy loss incorrectly while using the ignore_index option.
The problem is that currently when specifying the ignore_index (say, = k), the function just ignores the value of the target y = k (in fact, it calculates the cross entropy at k but returns 0) but it still makes full use of the logit at index k to calculate the normalization term for other indices not equal to k. I think this is not intended use for most users.
For example, in variable length sequences, people pad the sequence and use the ignore_index as the pad target index in order to avoid considering the padded values (both from inputs and targets). If there are n classes, you have to prepare (n+1) classes for the logit dimension (input of the cross entropy loss) to include the pad class and then ignore it by using the ignore_index option.
Here is some illustrative example:
# Test cross entropy loss: first create data
x = torch.log(torch.tensor([[2,3,4]],dtype=torch.float)) # a vector of 3-class logits (one of them could be a padding class)
y1 = torch.tensor([0],dtype=torch.long)
y2 = torch.tensor([1],dtype=torch.long)
y3 = torch.tensor([2],dtype=torch.long)
# calculate the negative logsoftmax for each logit index for comparison
-torch.nn.functional.log_softmax(x,dim=1) # returns tensor([1.5041, 1.0986, 0.8109])
# perform logsoftmax and NLL loss at the same time (not use ignore_index yet)
print(torch.nn.functional.cross_entropy(x,y1)) # 1.5041
print(torch.nn.functional.cross_entropy(x,y2)) # 1.0986
print(torch.nn.functional.cross_entropy(x,y3)) # 0.8109
# Now let's ignore the index 0 and find cross entropy loss for index 1
print(torch.nn.functional.cross_entropy(x,y2,ignore_index=0)) # get 1.0986
# this is the same value as when not excluding the index 0 from the logit;
# It should ignore the index since the level of the logit index, not just final target index.
# Next let's calculate the correct cross entropy loss when you actually ignore the index 0 completely from both x and y
x_ignore = x[0][1:].view(1,x.shape[-1]-1) # Now we have logits of 2 classes
# the index that is more than the ignore index is decreased by 1
y2_ignore = torch.tensor([0],dtype=torch.long)
y3_ignore = torch.tensor([1],dtype=torch.long)
# cross entropy with ignore_index 0 for the index 1 (which now becomes index 0)
print(torch.nn.functional.cross_entropy(x_ignore,y2_ignore)) # get 0.8473
In conclusion, I raise this issue in case developers may consider to revise this ignore_index option, but if the current one already follows the intended use (ignore only y not x; hence allowing to backprop through the ignored index of the logit in the normalization term of the softmax), it would be my misunderstanding of how it should work (ignore both x and y at the ignore_index).