Ignore padding area in loss computation

I am working on small texts doing Sequence Labelling. Specifically, I use a BERT model from the huggingface library (BertModel in particular), and I tokenize every text with the library’s tokenizer to feed the model. Since the texts are small, I have specified that the sequence length that the tokenizer produces is 256. My labels are binary (1 and 0) and every sequence element (BERT input token) is assigned a label.

For the loss computation I use Binary Cross Entropy (BCEWithLogitsLos) but the function considers also the padding tokens to compute the loss which also affects back propagation.

I want BCEWithLogitsLos to compute the loss only on the tokens of the text and not also on the padding tokens. Which is the best way to achieve that?

I think you could try to use the raw loss output (via reduction='none'), set the unwanted loss entries to zero, reduce the loss, and calculate the gradients via loss.backward(). Unsure, if there is a better way to mask the loss.

@ptrblck thank you for your response.

You are right, this is a clean way to implement it.
I tried it and even though the loss was different, the model metrics did not change.
I suppose that when you said to reduce the loss, because reduction='none', you meat to use torch.mean() after turning into zero the loss of the padded tokens.

I just mentioned that because the resulting tensor grad_fn is Mean, although the backward computation will have all steps, including BCEWithLogits.

I also cannot think of a better way to implement this.


Yes, that’s what I had in mind. The backward call should still work as intended and internally the mean reduction would do the same. In your manual approach the grad_fn would point to the mean operation, which shouldn’t be a concern.

That is also what I thought. Thanks again for your help.