Kullback-Leibler Divergence loss function giving negative values

Hi! Still playing with PyTorch and this time I was trying to make a neural network work with Kullback-Leibler divergence. As long as I have one-hot targets, I think that the results of it should be identical to the results of a neural network trained with the cross-entropy loss.

For completeness, I am giving the entire code for the neural net (which is the one used for the tutorial):

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool  = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16*5*5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = F.softmax(x)
        return x

net = Net()
net = net.cuda()

try:
    del net    
    net = Net()
    net = net.cuda()    
except NameError:
    net = Net()
    net = net.cuda()

The only change here, is that in the end, I apply softmax (KL divergence needs the data to be probabilities, and softmax achieves exactly that).

Then, I do the training:

criterion = nn.KLDivLoss() # use Kullback-Leibler divergence loss
optimizer = optim.Adam(net.parameters(), lr=3e-4)
number_of_classes = 10

for epoch in range(5): # loop over the dataset multiple times
    
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data            
        labels_one_hot = convert_labels_to_one_hot(labels, number_of_classes) 
        # wrap them in Variable
        inputs, labels = Variable(inputs).cuda(), Variable(labels_one_hot).cuda()
        optimizer.zero_grad()
        
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()        
        optimizer.step()
        
        # print statistics
        running_loss += loss.data[0]
        if i % 200 == 199: # print every 200 mini-batches
            print('[%d, %5d] loss: %.3f' % (epoch+1, i+1, running_loss / 200))
            running_loss = 0.0
print('Finished Training')

The only change in this part is that I convert labels to one hot labels. I do that with the following function:

def convert_labels_to_one_hot(labels, number_of_classes):
    number_of_observations = labels.size()[0]
    labels_one_hot = torch.zeros(number_of_observations, number_of_classes)
    for i in xrange(number_of_observations):
        label_value = labels[i]
        labels_one_hot[i, label_value] = 1.0
    return labels_one_hot 

Anyway, there is no backprop to this, so this shouldn’t cause problems. In addition, each row of this matrix contains a single 1, with all the other elements being 0, so it is a valid probability.

Now, the weird thing is that the loss function is negative. That just shouldn’t happen, considering that KL divergence should always be a nonnegative number. For 5 epochs, the results of the loss function are:

[1,   200] loss: -0.019
[2,   200] loss: -0.033
[3,   200] loss: -0.036
[4,   200] loss: -0.038
[5,   200] loss: -0.040

Anyone had similar problems in the past? Thanks in advance!

1 Like

Here labels must be the logarithm of a probability distribution, is it what you do ?
Because KLDivLoss will return sum(outputs * (log(outputs)-labels) so if your labels are (0 or 1)-vectors while outputs are probability smaller than one, you will necessary have negative values.

Nope. Labels are an one hot vector, with 1 for the correct label, and 0 on all the other members (this is the very simple case, in order to see if it performs the same way as with cross entropy loss).

Why do you think that labels should be log-probs? KL(P||Q) requires just P and Q to be valid probability distributions, nothing more. From the documentation of pytorch:

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.
As with NLLLoss, the input given is expected to contain log-probabilities, however unlike ClassNLLLoss, input is not restricted to a 2D Tensor, because the criterion is applied element-wise.
This criterion expects a target Tensor of the same size as the input Tensor.

Okay, on the other side the outputs of the net should be log-probs, and that is not achieved by softmax, but by log-softmax. Changing the line of code to:

x = F.log_softmax(x)

seems to make the loss function positive. Now need to do some testing.

Yes, my mistake, I was confused by the names of your variables. In your case, labels must be probabilities and outputs log-probabilities. Now it should work.

Yep, and the results of this cost function are very similar to that of cross entropy. Thanks!

On a curios note, why outputs should be log-probabilities? Is that just for numerical reasons or something deeper?

2 Likes

I think (but I am not sure, just trying to understand) the reason is the following: If you look at the code beside (in C):

sum += *target_data > 0 ? *target_data * (log(*target_data) - *input_data) : 0;

If it was *target_data * (log(*target_data) - log(*input_data)) you would have to also make sure that *input_data>0. Then, if it returns 0, you don’t know if the error is coming from target or input.

1 Like

@Ismail_Elezi yes, it improves numerical stability. If you look into how log_softmax is implemented, it’s not a softmax + log, but an alternative formulation.

2 Likes

@Ismail_Elezi
I am using this code to test the behaviour of KLDivLoss. I am using same tensor data for my input and target. So, I am expecting the loss to be zero.

    rand_data = torch.randn(1,1000)

    a = Variable(rand_data)
    b = Variable(rand_data)

    a_lsm = F.log_softmax(a)
    b_sm = F.softmax(b)
    
    criterion = nn.KLDivLoss()

    loss = criterion(a_lsm,b_sm)

    print(loss)

But when I run it a few times, it gives me very small numbers as outputs (both positive and negative). Can someone tell me if I am making a mistake here?

These are some of my outputs:
Variable containing:
1.00000e-12 *
1.4934
[torch.FloatTensor of size 1]

Variable containing:
1.00000e-11 *
1.8763
[torch.FloatTensor of size 1]

Variable containing:
1.00000e-11 *
-2.3461
[torch.FloatTensor of size 1]

I assume it might be due to a rounding error or maybe the different approach of calculating the log_softmax.
The error is even bigger comparing these two methods:

r = torch.randn(1, 1000).float()
a = r.clone()
b = r.clone()

err1 = a - b
print torch.sum(err1)
>> 0.0

err2 = torch.log(F.softmax(a)) - F.log_softmax(b)
print torch.sum(err2)
>> 1.0e-05*3.5763

what if for each sample , the loss can be either positive or negative, how do I sum the loss over mini-batces and do backpropation?

KL cannot be negative. In my case, I had a bug (solved in the first post).

@apaszke
Given two same tensors, how does one preprocess them using F.softmax and/or F.log_softmax so that when passed to nn.KLDivLoss(), i.e. input data is same as target data, it gives 0 as the result. Can you please provide working code?

1 Like

Why cannot KL-div become negative ?

If I am not making a mistake, the formula is: kl = prob_p (log(prob_p/prob_q); Since we are not sure if (prob_p/prob_q) is greater or smaller than 1, the kl-div can be both positive and negative depending on the input of prob_p and prob_q.

@coincheung
Check out https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Properties

The important point to note here is P and Q are probability distributions, so even though the value for a particular point in the sample space (discrete-case) is negative, the summation over all points in the sample space must be non-negative.

Look at https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Basic_example for an example calculation.