Sorry if my question is too basic. I’m comparing the three loss functions of BCEWithLogitsLoss, MultiLabelSoftMarginLoss, and KLDivLoss. Here are the formula for each one:

For BCEWithLogitsLoss, we have (assuming the reduction is `mean`

):

And for MultiLabelSoftMarginLoss, we have:

And finally, for KLDivLoss we have (with `batchmean`

reduction):

While I understand the first formula (it’s cross-entropy) and its use case, I’m not familiar with the next two. Could someone please explain to me when should I use any of these three loss functions? I’ve been reading the implementation for The Annotated Transformer where in my opinion was a good candidate for BCEWithLogitsLoss but the author had decided to use KLDivLoss. In general, what are the use cases for each of these three loss functions?