Labels smoothing and categorical loss functions - alternatives?

Labels smoothing seems to be important regularization technique now and important component of Sequence-to-sequence networks.

Implementing labels smoothing is fairly simple. It requires, however, one-hot encoded labels to be passed to the cost function (smoothing is changing one and zero to slightly different values).

Is there any way to implement it in PyTorch? Could I use maybe some different loss function, that accepts one-hot vectors, or rewrite nn.functional.cross_entropy so that the gradient can be derived?

Or maybe any different ideas how to use labels smoothing not going into Tensorflow?

1 Like

you could use http://pytorch.org/docs/0.3.0/nn.html#torch.nn.BCEWithLogitsLoss for this purpose

Thank you. So I should just use this function? As prediction I should pass softmax output and as labels, and I should pass smoothed vector like [0.05, 0.05, 0.9], is that correct?

Ok so I understand that I shouldn’t use softmax function, instead my predictions should be raw output of the last linear layer, and I should pass smoothed vector like [0.05, 0.05, 0.9] as a target. Moreover, it is ok to use BCELoss for multi-class classification.

I am looking for the confirmation that my ideas are ok.

you are pretty much on spot with your last comment. raw preds go into BCELossWithLogits and you can use it for multi-class classification (target is one-hot encoding)

1 Like

Hi, I got a bit confused here so sorry for asking again. So do we need to to label smoothing of target values or there is no need anymore if we use BCELossWithLogits ? Because in the last question by Dawid he mentioned that he should pass smoothed vector as a target, but from your answer it seems like we need to just pass the one-hot encoded (and not smoothed) target vector.
Thanks !

I think based on the paper, we need to convert the one-hot vector to the smoothed vector and use the original loss. But changing the loss criterion is doing the same math I think.

I have a question here, maybe naive but how does it work for multiclass?
I believe BCE loss and Sigmoid activation is for binary classes only. I Will be grateful to know about it in detail.

-Sidharth

Multi-label classification use cases, where zero, one or multiple classes can be active in each sample, can use nn.BCEWithLogitsLoss as the loss function.
The model output in this case should be [batch_size, nb_classes].

Multi-class classification use cases, where only a single class is active for each sample, would use nn.CrossEntropyLoss.

2 Likes

So how to use label smoothing in case of multi class classification with also class weights?

You could use e.g. nn.KLDivLoss and add the weighting to the unreduced loss or use a manual implementation for label smoothing (you should be able to find some posts in this forum).

1 Like

is the answer from here good? @ptrblck I am sort of confused what I should be using, to many options nothing that seems official pytorch.

class LabelSmoothingLoss(nn.Module):
    """
    With label smoothing,
    KL-divergence between q_{smoothed ground truth prob.}(w)
    and p_{prob. computed by model}(w) is minimized.
    """
    def __init__(self, label_smoothing, tgt_vocab_size, ignore_index=-100):
        assert 0.0 < label_smoothing <= 1.0
        self.ignore_index = ignore_index
        super(LabelSmoothingLoss, self).__init__()

        smoothing_value = label_smoothing / (tgt_vocab_size - 2)
        one_hot = torch.full((tgt_vocab_size,), smoothing_value)
        one_hot[self.ignore_index] = 0
        self.register_buffer('one_hot', one_hot.unsqueeze(0))

        self.confidence = 1.0 - label_smoothing

    def forward(self, output, target):
        """
        output (FloatTensor): batch_size x n_classes
        target (LongTensor): batch_size
        """
        model_prob = self.one_hot.repeat(target.size(0), 1)
        model_prob.scatter_(1, target.unsqueeze(1), self.confidence)
        model_prob.masked_fill_((target == self.ignore_index).unsqueeze(1), 0)

        return F.kl_div(output, model_prob, reduction='sum')

or is something else recommended?

obtained from:


references: