Masking input to loss function

huberl · May 20, 2021, 9:25am

I’m currently implementing pseudo labeling, where I create the labels for the unlabeled part of the datset by simply running the samples trough the model and using the prediction as ground truth. I’m only using the prediction for a sample as ground truth, however, if its confidence surpasses a given threshold.

To implement this, I tried using two approaches:

conf, pseudo_label = F.softmax(out, dim=1).max(axis=1)
mask = conf > threshold

# Option 1
loss = F.cross_entropy(out[mask], pseudo_label[mask])

# Option 2
loss = (F.cross_entropy(out, pseudo_label, reduction='none') * mask).mean()

Which of them is preferrable? Option 1 produces nan values when the mask is False for every item, but option 2 is way more computationally expensive since it always runs the whole dataset trough the loss function.

huberl · May 21, 2021, 7:35am

I tried multiple other things:

tnsr = torch.Tensor([1,2,3,4])
mask = torch.BoolTensor([False, False, True, True])
a = tnsr * mask  # Gives tensor([0., 0., 3., 4.])
b = torch.masked_select(tnsr, mask)   # Gives tensor([3., 4.])
c = tnsr[mask]     # Gives tensor([3., 4.])

I’m still not sure which of these options to choose.

pascal_notsawo · May 21, 2021, 8:02am

I would advise you to choose option 1

This is an option I’ve always used to train my language models with the Mask Language Modeling objective: once I mask the tokens in my sentence and pass it to the BERT encoder, I use a mask to select the representations of the masked tokens that I then project to get probabilities.
As you can also see in the original code of XLM (GitHub - facebookresearch/XLM: PyTorch original implementation of Cross-lingual Language Model Pretraining.) : here for the mask selection and here for the loss computation.

bdzyubak · June 6, 2024, 9:15pm

Thanks for the post. I am new to language models. Can you comment on whether the following is the best way to train?
I am training abstractive question answering with transformers. My labels are company names tokenized to 25 tokens. Output generation is maxed at 25 tokens but sometimes yields less. Can/should I do the following as a rule?

out = F.pad(out, (1, labels.shape[1] - labels.shape[1] - 1))
mask = out > 0
loss = F.cross_entropy(out[mask], labels[mask])