@ptrblck thank you for your response.
You are right, this is a clean way to implement it.
I tried it and even though the loss was different, the model metrics did not change.
I suppose that when you said to reduce the loss, because reduction='none'
, you meat to use torch.mean()
after turning into zero the loss of the padded tokens.
I just mentioned that because the resulting tensor grad_fn
is Mean
, although the backward computation will have all steps, including BCEWithLogits
.
I also cannot think of a better way to implement this.
Regards.