Ignore padding area in loss computation

@ptrblck thank you for your response.

You are right, this is a clean way to implement it.
I tried it and even though the loss was different, the model metrics did not change.
I suppose that when you said to reduce the loss, because reduction='none', you meat to use torch.mean() after turning into zero the loss of the padded tokens.

I just mentioned that because the resulting tensor grad_fn is Mean, although the backward computation will have all steps, including BCEWithLogits.

I also cannot think of a better way to implement this.

Regards.