Weighted sum of two losses can not reduce to the loss of one component

weedwind · July 6, 2018, 3:17am

Hi,

I created a loss function, which is the weighted sum of two losses:

Loss = a * loss1 + b * loss2

in which loss1 is a CTC loss, and loss 2 is a KL divergence loss, and a, b are adjustable values. To verify the correctness of the loss, I first removed loss2, so in this case Loss = loss1, and trained my network. After that, I set a = 1, and b = 0, so Loss = 1 * loss1 + 0 * loss2, and I expected the same result as the previous case. However, I got very different result. I am wondering any suggestions for the difference?

Thank you so much for your help.

Mike

Jk749 · July 6, 2018, 5:45am

Can you post some sample code ?

When you remove loss2, loss=a*loss1; what’s the value of ‘a’ here?

weedwind · July 6, 2018, 6:52am

I have two models, an SI (speaker independent) model, which is already trained, and an SD (speaker dependent) model (to be learned). At the beginning, they are the same. I want to adapt SI to a new speaker by minimizing CTC loss on SD data, starting from SI model. But since I do not want to overfit, I add a weighted KLD loss to the CTC, to prevent the adapted model to go too far from the SI model. The weighting factor is self.mu in the loss. The loss code looks like this:

import torch
from warp_ctc_pytorch import CTCLoss
from torch.nn import KLDivLoss as KLD
from torch.nn.functional import F

class CTC_KLD(nn.Module):
def init(self, mu):
super(CTC_KLD, self).init()
self.mu = mu
self.ctc_loss = CTCLoss(length_average = True)
self.KLD = KLD(size_average = False)

def forward(self, SI_logits, SD_logits, SD_targets, SD_target_sizes, input_sizes, input_sizes_list):
SD_logits_ctc = torch.transpose(SD_logits, 0, 1).contiguous() # SD_logits_ctc size: T, N, D
CTC_loss = self.ctc_loss(SD_logits_ctc, SD_targets, input_sizes, SD_target_sizes).type(torch.cuda.FloatTensor) # ctc loss

   SI_logits = rnn_utils.pack_padded_sequence(SI_logits, input_sizes_list, batch_first = True).data  
  
   SD_logits_KL = rnn_utils.pack_padded_sequence(SD_logits, input_sizes_list, batch_first = True).data

   batch_size = SI_logits.size(0)
   log_probs_SD = F.log_softmax(SD_logits_KL, dim = 1)

   probs_SI = F.softmax(SI_logits, dim = 1)

   KLD_loss = self.KLD(log_probs_SD, probs_SI) / batch_size

   loss = (1.0 - self.mu) * CTC_loss + self.mu * KLD_loss

   return loss

In the main script, for an input variable x of size N, T, D, which contains N sequences from the new speaker, it first goes through both SI and SD models, to obtain SD_logits and SI_logits, then I detach SI_logits from the graph using SI_logits = SI_logits.detach(), since the SI model should not be updated. It only provides the targets for the KLD loss. Then, I pass SI_logits and SD_logits through the loss function.

In the above loss code, if I wrote loss = CTC_loss, then the training works fine. But when I wrote loss = 1.0 * CTC_loss + 0.0 * KLD_loss (in which self.mu = 0), the result (measured by word error rate in speech recognition) becomes very different than simply writing loss = CTC_loss, but they should be the same loss function (with only CTC_loss). Anyone has any ideas why they differ a lot?

rm031 · July 24, 2018, 9:20am

Have you figure out the solution? I have the same problem here!!!

rm031 · July 24, 2018, 9:30am

My code is as following, when alpha is set to 0 in the first function and train the network, I expect to get similiar behavior when using second function for training. But I get totally different results!!! Setting alpha to 0 leads to wrong results. This bothers me a lot.

def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
    """
    Compute the knowledge-distillation (KD) loss given outputs, labels.
    "Hyperparameters": temperature and alpha
    """

    loss1 = nn.KLDivLoss(size_average=False)(F.log_softmax(outputs/T, dim=1),
                            F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T)
    loss2 = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)

    KD_loss = loss1 + loss2

    return KD_loss / outputs.size(0)

def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
    """
    Compute the knowledge-distillation (KD) loss given outputs, labels.
    "Hyperparameters": temperature and alpha
    """

    KD_loss = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)

    return KD_loss / outputs.size(0)

weedwind · July 24, 2018, 8:41pm

Continuing the discussion from Weighted sum of two losses can not reduce to the loss of one component:

Hi,

I throw away the built-in KLDivLoss. Instead, I wrote my own KLD. I think in knowledge distillation, you can use the part in the KLD loss which is dependent on your student model only, and throw away the other part.

The original KLD loss is:
P_teacher * log(P_teacher / P_student)

I use:

P_teacher * log(1/P_student)

since this is the term related to your student model. In fact, this modified KLD is what most paper uses.

You can implement this loss in a module, just like how you define a network. I did not see your problem with the modified KLD.

rm031 · August 2, 2018, 8:48am

yes, this works! Thanks