Hi. I think Pytorch calculates the cross entropy loss incorrectly while using the ignore_index option.
The problem is that currently when specifying the ignore_index (say, = k), the function just ignores the value of the target y = k (in fact, it calculates the cross entropy at k but returns 0) but it still makes full use of the logit at index k to calculate the normalization term for other indices not equal to k. I think this is not intended use for most users.
For example, in variable length sequences, people pad the sequence and use the ignore_index as the pad target index in order to avoid considering the padded values (both from inputs and targets). If there are n classes, you have to prepare (n+1) classes for the logit dimension (input of the cross entropy loss) to include the pad class and then ignore it by using the ignore_index option.
Here is some illustrative example:
# Test cross entropy loss: first create data x = torch.log(torch.tensor([[2,3,4]],dtype=torch.float)) # a vector of 3-class logits (one of them could be a padding class) y1 = torch.tensor(,dtype=torch.long) y2 = torch.tensor(,dtype=torch.long) y3 = torch.tensor(,dtype=torch.long) # calculate the negative logsoftmax for each logit index for comparison -torch.nn.functional.log_softmax(x,dim=1) # returns tensor([1.5041, 1.0986, 0.8109]) # perform logsoftmax and NLL loss at the same time (not use ignore_index yet) print(torch.nn.functional.cross_entropy(x,y1)) # 1.5041 print(torch.nn.functional.cross_entropy(x,y2)) # 1.0986 print(torch.nn.functional.cross_entropy(x,y3)) # 0.8109 # Now let's ignore the index 0 and find cross entropy loss for index 1 print(torch.nn.functional.cross_entropy(x,y2,ignore_index=0)) # get 1.0986 # this is the same value as when not excluding the index 0 from the logit; # It should ignore the index since the level of the logit index, not just final target index. # Next let's calculate the correct cross entropy loss when you actually ignore the index 0 completely from both x and y x_ignore = x[1:].view(1,x.shape[-1]-1) # Now we have logits of 2 classes # the index that is more than the ignore index is decreased by 1 y2_ignore = torch.tensor(,dtype=torch.long) y3_ignore = torch.tensor(,dtype=torch.long) # cross entropy with ignore_index 0 for the index 1 (which now becomes index 0) print(torch.nn.functional.cross_entropy(x_ignore,y2_ignore)) # get 0.8473
In conclusion, I raise this issue in case developers may consider to revise this ignore_index option, but if the current one already follows the intended use (ignore only y not x; hence allowing to backprop through the ignored index of the logit in the normalization term of the softmax), it would be my misunderstanding of how it should work (ignore both x and y at the ignore_index).