Labels smoothing and categorical loss functions - alternatives?

Dawid_S · December 19, 2017, 12:18pm

Labels smoothing seems to be important regularization technique now and important component of Sequence-to-sequence networks.

Implementing labels smoothing is fairly simple. It requires, however, one-hot encoded labels to be passed to the cost function (smoothing is changing one and zero to slightly different values).

Is there any way to implement it in PyTorch? Could I use maybe some different loss function, that accepts one-hot vectors, or rewrite nn.functional.cross_entropy so that the gradient can be derived?

Or maybe any different ideas how to use labels smoothing not going into Tensorflow?

smth · December 19, 2017, 3:18pm

you could use http://pytorch.org/docs/0.3.0/nn.html#torch.nn.BCEWithLogitsLoss for this purpose

Dawid_S · December 28, 2017, 12:06pm

Thank you. So I should just use this function? As prediction I should pass softmax output and as labels, and I should pass smoothed vector like [0.05, 0.05, 0.9], is that correct?

Dawid_S · December 29, 2017, 1:28pm

Ok so I understand that I shouldn’t use softmax function, instead my predictions should be raw output of the last linear layer, and I should pass smoothed vector like [0.05, 0.05, 0.9] as a target. Moreover, it is ok to use BCELoss for multi-class classification.

I am looking for the confirmation that my ideas are ok.

smth · December 29, 2017, 2:13pm

you are pretty much on spot with your last comment. raw preds go into BCELossWithLogits and you can use it for multi-class classification (target is one-hot encoding)

sharifza · March 2, 2018, 3:13pm

Hi, I got a bit confused here so sorry for asking again. So do we need to to label smoothing of target values or there is no need anymore if we use BCELossWithLogits ? Because in the last question by Dawid he mentioned that he should pass smoothed vector as a target, but from your answer it seems like we need to just pass the one-hot encoded (and not smoothed) target vector.
Thanks !

chenyangh · May 4, 2018, 5:16am

I think based on the paper, we need to convert the one-hot vector to the smoothed vector and use the original loss. But changing the loss criterion is doing the same math I think.

sidSingla · October 13, 2020, 11:24pm

I have a question here, maybe naive but how does it work for multiclass?
I believe BCE loss and Sigmoid activation is for binary classes only. I Will be grateful to know about it in detail.

-Sidharth

ptrblck · October 14, 2020, 8:02am

Multi-label classification use cases, where zero, one or multiple classes can be active in each sample, can use nn.BCEWithLogitsLoss as the loss function.
The model output in this case should be [batch_size, nb_classes].

Multi-class classification use cases, where only a single class is active for each sample, would use nn.CrossEntropyLoss.

Mrig · November 28, 2020, 11:39am

So how to use label smoothing in case of multi class classification with also class weights?

ptrblck · November 29, 2020, 7:37am

You could use e.g. nn.KLDivLoss and add the weighting to the unreduced loss or use a manual implementation for label smoothing (you should be able to find some posts in this forum).

Brando_Miranda · March 23, 2021, 8:11pm

is the answer from here good? @ptrblck I am sort of confused what I should be using, to many options nothing that seems official pytorch.

github.com

OpenNMT/OpenNMT-py/blob/e8622eb5c6117269bb3accd8eb6f66282b5e67d9/onmt/utils/loss.py#L186


        num_non_padding = non_padding.sum().item()
        return onmt.utils.Statistics(loss.item(), num_non_padding, num_correct)

    def _bottle(self, _v):
        return _v.view(-1, _v.size(2))

    def _unbottle(self, _v, batch_size):
        return _v.view(-1, batch_size, _v.size(1))


class LabelSmoothingLoss(nn.Module):
    """
    With label smoothing,
    KL-divergence between q_{smoothed ground truth prob.}(w)
    and p_{prob. computed by model}(w) is minimized.
    """
    def __init__(self, label_smoothing, tgt_vocab_size, ignore_index=-100):
        assert 0.0 < label_smoothing <= 1.0
        self.ignore_index = ignore_index
        super(LabelSmoothingLoss, self).__init__()

class LabelSmoothingLoss(nn.Module):
    """
    With label smoothing,
    KL-divergence between q_{smoothed ground truth prob.}(w)
    and p_{prob. computed by model}(w) is minimized.
    """
    def __init__(self, label_smoothing, tgt_vocab_size, ignore_index=-100):
        assert 0.0 < label_smoothing <= 1.0
        self.ignore_index = ignore_index
        super(LabelSmoothingLoss, self).__init__()

        smoothing_value = label_smoothing / (tgt_vocab_size - 2)
        one_hot = torch.full((tgt_vocab_size,), smoothing_value)
        one_hot[self.ignore_index] = 0
        self.register_buffer('one_hot', one_hot.unsqueeze(0))

        self.confidence = 1.0 - label_smoothing

    def forward(self, output, target):
        """
        output (FloatTensor): batch_size x n_classes
        target (LongTensor): batch_size
        """
        model_prob = self.one_hot.repeat(target.size(0), 1)
        model_prob.scatter_(1, target.unsqueeze(1), self.confidence)
        model_prob.masked_fill_((target == self.ignore_index).unsqueeze(1), 0)

        return F.kl_div(output, model_prob, reduction='sum')

or is something else recommended?

obtained from:

references: