Masking input to loss function

I’m currently implementing pseudo labeling, where I create the labels for the unlabeled part of the datset by simply running the samples trough the model and using the prediction as ground truth. I’m only using the prediction for a sample as ground truth, however, if its confidence surpasses a given threshold.

To implement this, I tried using two approaches:

conf, pseudo_label = F.softmax(out, dim=1).max(axis=1)
mask = conf > threshold

# Option 1
loss = F.cross_entropy(out[mask], pseudo_label[mask])

# Option 2
loss = (F.cross_entropy(out, pseudo_label, reduction='none') * mask).mean()

Which of them is preferrable? Option 1 produces nan values when the mask is False for every item, but option 2 is way more computationally expensive since it always runs the whole dataset trough the loss function.

I tried multiple other things:

tnsr = torch.Tensor([1,2,3,4])
mask = torch.BoolTensor([False, False, True, True])
a = tnsr * mask  # Gives tensor([0., 0., 3., 4.])
b = torch.masked_select(tnsr, mask)   # Gives tensor([3., 4.])
c = tnsr[mask]     # Gives tensor([3., 4.])

I’m still not sure which of these options to choose.

I would advise you to choose option 1

This is an option I’ve always used to train my language models with the Mask Language Modeling objective: once I mask the tokens in my sentence and pass it to the BERT encoder, I use a mask to select the representations of the masked tokens that I then project to get probabilities.
As you can also see in the original code of XLM (GitHub - facebookresearch/XLM: PyTorch original implementation of Cross-lingual Language Model Pretraining.) : here for the mask selection and here for the loss computation.