Hi! Still playing with PyTorch and this time I was trying to make a neural network work with Kullback-Leibler divergence. As long as I have one-hot targets, I think that the results of it should be identical to the results of a neural network trained with the cross-entropy loss.

For completeness, I am giving the entire code for the neural net (which is the one used for the tutorial):

```
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2,2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16*5*5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16*5*5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
x = F.softmax(x)
return x
net = Net()
net = net.cuda()
try:
del net
net = Net()
net = net.cuda()
except NameError:
net = Net()
net = net.cuda()
```

The only change here, is that in the end, I apply softmax (KL divergence needs the data to be probabilities, and softmax achieves exactly that).

Then, I do the training:

```
criterion = nn.KLDivLoss() # use Kullback-Leibler divergence loss
optimizer = optim.Adam(net.parameters(), lr=3e-4)
number_of_classes = 10
for epoch in range(5): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs
inputs, labels = data
labels_one_hot = convert_labels_to_one_hot(labels, number_of_classes)
# wrap them in Variable
inputs, labels = Variable(inputs).cuda(), Variable(labels_one_hot).cuda()
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.data[0]
if i % 200 == 199: # print every 200 mini-batches
print('[%d, %5d] loss: %.3f' % (epoch+1, i+1, running_loss / 200))
running_loss = 0.0
print('Finished Training')
```

The only change in this part is that I convert labels to one hot labels. I do that with the following function:

```
def convert_labels_to_one_hot(labels, number_of_classes):
number_of_observations = labels.size()[0]
labels_one_hot = torch.zeros(number_of_observations, number_of_classes)
for i in xrange(number_of_observations):
label_value = labels[i]
labels_one_hot[i, label_value] = 1.0
return labels_one_hot
```

Anyway, there is no backprop to this, so this shouldnâ€™t cause problems. In addition, each row of this matrix contains a single 1, with all the other elements being 0, so it is a valid probability.

Now, the weird thing is that the loss function is negative. That just shouldnâ€™t happen, considering that KL divergence should always be a nonnegative number. For 5 epochs, the results of the loss function are:

```
[1, 200] loss: -0.019
[2, 200] loss: -0.033
[3, 200] loss: -0.036
[4, 200] loss: -0.038
[5, 200] loss: -0.040
```

Anyone had similar problems in the past? Thanks in advance!