Kullback-Leibler Divergence loss function giving negative values

Ismail_Elezi · February 27, 2017, 11:00am

Hi! Still playing with PyTorch and this time I was trying to make a neural network work with Kullback-Leibler divergence. As long as I have one-hot targets, I think that the results of it should be identical to the results of a neural network trained with the cross-entropy loss.

For completeness, I am giving the entire code for the neural net (which is the one used for the tutorial):

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool  = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16*5*5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = F.softmax(x)
        return x

net = Net()
net = net.cuda()

try:
    del net    
    net = Net()
    net = net.cuda()    
except NameError:
    net = Net()
    net = net.cuda()

The only change here, is that in the end, I apply softmax (KL divergence needs the data to be probabilities, and softmax achieves exactly that).

Then, I do the training:

criterion = nn.KLDivLoss() # use Kullback-Leibler divergence loss
optimizer = optim.Adam(net.parameters(), lr=3e-4)
number_of_classes = 10

for epoch in range(5): # loop over the dataset multiple times
    
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data            
        labels_one_hot = convert_labels_to_one_hot(labels, number_of_classes) 
        # wrap them in Variable
        inputs, labels = Variable(inputs).cuda(), Variable(labels_one_hot).cuda()
        optimizer.zero_grad()
        
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()        
        optimizer.step()
        
        # print statistics
        running_loss += loss.data[0]
        if i % 200 == 199: # print every 200 mini-batches
            print('[%d, %5d] loss: %.3f' % (epoch+1, i+1, running_loss / 200))
            running_loss = 0.0
print('Finished Training')

The only change in this part is that I convert labels to one hot labels. I do that with the following function:

def convert_labels_to_one_hot(labels, number_of_classes):
    number_of_observations = labels.size()[0]
    labels_one_hot = torch.zeros(number_of_observations, number_of_classes)
    for i in xrange(number_of_observations):
        label_value = labels[i]
        labels_one_hot[i, label_value] = 1.0
    return labels_one_hot

Anyway, there is no backprop to this, so this shouldn’t cause problems. In addition, each row of this matrix contains a single 1, with all the other elements being 0, so it is a valid probability.

Now, the weird thing is that the loss function is negative. That just shouldn’t happen, considering that KL divergence should always be a nonnegative number. For 5 epochs, the results of the loss function are:

[1,   200] loss: -0.019
[2,   200] loss: -0.033
[3,   200] loss: -0.036
[4,   200] loss: -0.038
[5,   200] loss: -0.040

Anyone had similar problems in the past? Thanks in advance!

alexis-jacq · February 27, 2017, 11:12am

Here labels must be the logarithm of a probability distribution, is it what you do ?
Because KLDivLoss will return sum(outputs * (log(outputs)-labels) so if your labels are (0 or 1)-vectors while outputs are probability smaller than one, you will necessary have negative values.

Ismail_Elezi · February 27, 2017, 11:23am

Nope. Labels are an one hot vector, with 1 for the correct label, and 0 on all the other members (this is the very simple case, in order to see if it performs the same way as with cross entropy loss).

Why do you think that labels should be log-probs? KL(P||Q) requires just P and Q to be valid probability distributions, nothing more. From the documentation of pytorch:

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.
As with NLLLoss, the input given is expected to contain log-probabilities, however unlike ClassNLLLoss, input is not restricted to a 2D Tensor, because the criterion is applied element-wise.
This criterion expects a target Tensor of the same size as the input Tensor.

Okay, on the other side the outputs of the net should be log-probs, and that is not achieved by softmax, but by log-softmax. Changing the line of code to:

x = F.log_softmax(x)

seems to make the loss function positive. Now need to do some testing.

alexis-jacq · February 27, 2017, 11:27am

Yes, my mistake, I was confused by the names of your variables. In your case, labels must be probabilities and outputs log-probabilities. Now it should work.

Ismail_Elezi · February 27, 2017, 11:29am

Yep, and the results of this cost function are very similar to that of cross entropy. Thanks!

On a curios note, why outputs should be log-probabilities? Is that just for numerical reasons or something deeper?

alexis-jacq · February 27, 2017, 12:04pm

I think (but I am not sure, just trying to understand) the reason is the following: If you look at the code beside (in C):

sum += *target_data > 0 ? *target_data * (log(*target_data) - *input_data) : 0;

If it was *target_data * (log(*target_data) - log(*input_data)) you would have to also make sure that *input_data>0. Then, if it returns 0, you don’t know if the error is coming from target or input.

apaszke · February 27, 2017, 6:21pm

@Ismail_Elezi yes, it improves numerical stability. If you look into how log_softmax is implemented, it’s not a softmax + log, but an alternative formulation.

ani0075 · November 4, 2017, 6:49pm

@Ismail_Elezi
I am using this code to test the behaviour of KLDivLoss. I am using same tensor data for my input and target. So, I am expecting the loss to be zero.

    rand_data = torch.randn(1,1000)

    a = Variable(rand_data)
    b = Variable(rand_data)

    a_lsm = F.log_softmax(a)
    b_sm = F.softmax(b)
    
    criterion = nn.KLDivLoss()

    loss = criterion(a_lsm,b_sm)

    print(loss)

But when I run it a few times, it gives me very small numbers as outputs (both positive and negative). Can someone tell me if I am making a mistake here?

These are some of my outputs:
Variable containing:
1.00000e-12 *
1.4934
[torch.FloatTensor of size 1]

Variable containing:
1.00000e-11 *
1.8763
[torch.FloatTensor of size 1]

Variable containing:
1.00000e-11 *
-2.3461
[torch.FloatTensor of size 1]

ptrblck · November 6, 2017, 2:15pm

I assume it might be due to a rounding error or maybe the different approach of calculating the log_softmax.
The error is even bigger comparing these two methods:

r = torch.randn(1, 1000).float()
a = r.clone()
b = r.clone()

err1 = a - b
print torch.sum(err1)
>> 0.0

err2 = torch.log(F.softmax(a)) - F.log_softmax(b)
print torch.sum(err2)
>> 1.0e-05*3.5763

zack.zcy · November 8, 2017, 10:51am

what if for each sample , the loss can be either positive or negative, how do I sum the loss over mini-batces and do backpropation?

Ismail_Elezi · November 28, 2017, 11:08am

KL cannot be negative. In my case, I had a bug (solved in the first post).

ani0075 · January 19, 2018, 8:48pm

@apaszke
Given two same tensors, how does one preprocess them using F.softmax and/or F.log_softmax so that when passed to nn.KLDivLoss(), i.e. input data is same as target data, it gives 0 as the result. Can you please provide working code?

coincheung · November 29, 2019, 9:25am

Why cannot KL-div become negative ?

If I am not making a mistake, the formula is: kl = prob_p (log(prob_p/prob_q); Since we are not sure if (prob_p/prob_q) is greater or smaller than 1, the kl-div can be both positive and negative depending on the input of prob_p and prob_q.

ani0075 · January 14, 2020, 4:35pm

@coincheung
Check out https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Properties

The important point to note here is P and Q are probability distributions, so even though the value for a particular point in the sample space (discrete-case) is negative, the summation over all points in the sample space must be non-negative.

Look at https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Basic_example for an example calculation.

saba · August 25, 2020, 4:59am

Hi, I am trying KL diastance between two distribution with size of 6499 , 64 is the number of batches and 9*9 is my data size. it give me this error

RuntimeError: bool value of Tensor with more than one value is ambiguous

MSECMBSS=torch.nn.KLDivLoss(DATA1, Data2)

ptrblck · August 25, 2020, 8:50am

The input shapes should work, as seen in this code snippet:

a = torch.log_softmax(torch.randn(64, 81), dim=1)
b = torch.softmax(torch.randn(64, 81), dim=1)
criterion = nn.KLDivLoss()

loss = criterion(a, b)

Could you post an executable code snippet to reproduce this issue?

saba · August 26, 2020, 2:32am

Hi Ptrblck,

I am using EMD distance from “https://neuralnet-pytorch.readthedocs.io/en/latest/_modules/neuralnet_pytorch/metrics.html?highlight=.extensions#”

I get this error for “import neuralnet_pytorch.ext as ext”
ModuleNotFoundError: No module named ‘neuralnet_pytorch.ext’

Would you please help me with that?

Punitha_Valli · May 6, 2021, 11:55am

preds_T[0].detach()
# print (preds_S[0].shape)
# print (preds_T[0].shape)
assert preds_S[0].shape == preds_T[0].shape , 'The output dim of teacher & student differ ’
C,W,H = preds_S.shape
#C = 1
#softmax_pred_T = F.softmax(preds_T[0].permute(0,1).contiguous().view(-1,1), dim = 1)
softmax_pred_T = F.softmax(preds_T[0], dim = 1)
#softmax_pred_S = self.LogSoftMax(preds_S[0].permute(0,1).contiguous())
#softmax_pred_S = nn.LogSoftmax(preds_S[0].permute(0,1).contiguous().view(-1,1),dim=1)
softmax_pred_S = nn.LogSoftmax(preds_S[0])
#loss = (torch.sum(-softmax_pred_T * softmax_pred_S ))
#loss = self.KlLoss(softmax_pred_S,softmax_pred_T)
softmax_pred_T1 = F.softmax(preds_T[0].permute(0, 1).contiguous().view(-1, C), dim=1)
logsoftmax1 = nn.LogSoftmax(dim=1)
softmax_pred_S1 = nn.LogSoftmax(preds_S[0].permute(0,1).contiguous().view(-1,C))
loss1 = (torch.sum( - softmax_pred_T1 * logsoftmax1(preds_S[0].permute(0,1).contiguous().view(-1,C))))/W/H

    loss = nn.KLDivLoss(reduction='batchmean').cuda()
    loss_value = loss(softmax_pred_S.dim,softmax_pred_T)

i do the same like softmax and log softmax, but my loos_value is always negative, can you please clear me. it will be a great help thanks in advance @ptrblck @coincheung @saba @ani0075 @ani0075 @apaszke @alexis-jacq @alexis-jacq

ptrblck · May 6, 2021, 6:23pm

Based on your code snippet it seems that you are using nn.Modules in a wrong way.
The proper way is to create an object and pass the input(s) to it to get the result.
However, here:

softmax_pred_S = nn.LogSoftmax(preds_S[0])

you are creating the nn.LogSoftmax module with what seems to be a tensor. This value would be used as the dim argument.
Later you are then passing the dim to the loss calculation:

loss_value = loss(softmax_pred_S.dim,softmax_pred_T)

which is also wrong, since a tensor with log probabilities would be expected, not a dimension value.

Roj · May 7, 2021, 8:05am

For the record, proofs that the KL divergence is non-negative assume that both distributions are normalised with respect to the same measure. If however the variables with respect to which the distributions are computed differ (for example, you compare distributions over two layers with different dimensions), then it is possible to obtain a negative result. There’s nothing pathological about this per se, it just means you need to renormalise your target distribution.

Unsure if this is what’s causing your problem, but it’s worth keeping this technical caveat in mind, since it can arise in some applications.