For BCEWithLogitsLoss, we have (assuming the reduction is
And for MultiLabelSoftMarginLoss, we have:
And finally, for KLDivLoss we have (with
While I understand the first formula (it’s cross-entropy) and its use case, I’m not familiar with the next two. Could someone please explain to me when should I use any of these three loss functions? I’ve been reading the implementation for The Annotated Transformer where in my opinion was a good candidate for BCEWithLogitsLoss but the author had decided to use KLDivLoss. In general, what are the use cases for each of these three loss functions?