Does selecting tensor using a mask produces same effect as ignore_index in Cross Entropy loss?

I have a model wherein I need to compute the cross-entropy loss for only a certain index. The node_mask allows for the selection of those indices. There is also ignore_index field in the Cross-Entropy loss class. Since I have the gt for only these masked labels, I thought maybe it would make sense in terms of the overall space efficiency if we just do the masked selection beforehand.

x = self.linear1(self.node_features)  # W, 200
x = self.linear2(x)  # W, 1024
x = F.relu(x)
x = x[self.node_mask] # W, 14

However, I am still not sure if this is going to lead to the same effect or I should stick to ignore_index term and expand my Ground truth to a 1024 dimensional vector with -100 at the non-mask indices.

Any help would be highly appreciated.

I think this is a very cool question!

CrossEntropyLoss consists of two things, a LogSoftmax followed by a NLLLoss (Negative Log Likelihood).
Accordingly, there are two parts:

  • If you select the the rows (you could also use x[:, class_mask] btw if you have fixed classes), the implicit (log) softmax will be only over the selected classes, i.e. they wil sum (or logsumexp) to 1 (0). If you don’t select them, the other items in the tensor will lower the class (log) softmax of the selected ones (unless they’re all -infinity).

    One comparison to draw might be to the training of word2vec etc. where the negative sampling is closer in spirit to selecting the classes.

    If you wanted as-if behaviour for the tensor without using a mask, you could make a tensor where the not-to-be-considered items are set to -infinity (inplace by assigning or in a new tensor using where, depending on your needs).

  • In the NLL loss optimization only the likelihood of the data, i.e. the probability the model assigns to the true class, is pushed up (by minimizing the negative log likelihood). This means that unless the true class is sometimes one of the indices you want to ignore, it will be all the same regardless what you do. The key difference is that the labeling mus be 0…num_classes-1 so depending on what your target data is initially, you may have to relabel for one solution or the other.

Best regards

Thomas

1 Like