Sorry if my question is too basic. I’m comparing the three loss functions of BCEWithLogitsLoss, MultiLabelSoftMarginLoss, and KLDivLoss. Here are the formula for each one:
For BCEWithLogitsLoss, we have (assuming the reduction is mean
):
And for MultiLabelSoftMarginLoss, we have:
And finally, for KLDivLoss we have (with batchmean
reduction):
While I understand the first formula (it’s cross-entropy) and its use case, I’m not familiar with the next two. Could someone please explain to me when should I use any of these three loss functions? I’ve been reading the implementation for The Annotated Transformer where in my opinion was a good candidate for BCEWithLogitsLoss but the author had decided to use KLDivLoss. In general, what are the use cases for each of these three loss functions?