I’m currently implementing pseudo labeling, where I create the labels for the unlabeled part of the datset by simply running the samples trough the model and using the prediction as ground truth. I’m only using the prediction for a sample as ground truth, however, if its confidence surpasses a given threshold.
Which of them is preferrable? Option 1 produces nan values when the mask is False for every item, but option 2 is way more computationally expensive since it always runs the whole dataset trough the loss function.
This is an option I’ve always used to train my language models with the Mask Language Modeling objective: once I mask the tokens in my sentence and pass it to the BERT encoder, I use a mask to select the representations of the masked tokens that I then project to get probabilities.
As you can also see in the original code of XLM (GitHub - facebookresearch/XLM: PyTorch original implementation of Cross-lingual Language Model Pretraining.) : here for the mask selection and here for the loss computation.