Hi! Still playing with PyTorch and this time I was trying to make a neural network work with Kullback-Leibler divergence. As long as I have one-hot targets, I think that the results of it should be identical to the results of a neural network trained with the cross-entropy loss.
For completeness, I am giving the entire code for the neural net (which is the one used for the tutorial):
class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2,2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16*5*5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16*5*5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) x = F.softmax(x) return x net = Net() net = net.cuda() try: del net net = Net() net = net.cuda() except NameError: net = Net() net = net.cuda()
The only change here, is that in the end, I apply softmax (KL divergence needs the data to be probabilities, and softmax achieves exactly that).
Then, I do the training:
criterion = nn.KLDivLoss() # use Kullback-Leibler divergence loss optimizer = optim.Adam(net.parameters(), lr=3e-4) number_of_classes = 10 for epoch in range(5): # loop over the dataset multiple times running_loss = 0.0 for i, data in enumerate(trainloader, 0): # get the inputs inputs, labels = data labels_one_hot = convert_labels_to_one_hot(labels, number_of_classes) # wrap them in Variable inputs, labels = Variable(inputs).cuda(), Variable(labels_one_hot).cuda() optimizer.zero_grad() # forward + backward + optimize outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics running_loss += loss.data if i % 200 == 199: # print every 200 mini-batches print('[%d, %5d] loss: %.3f' % (epoch+1, i+1, running_loss / 200)) running_loss = 0.0 print('Finished Training')
The only change in this part is that I convert labels to one hot labels. I do that with the following function:
def convert_labels_to_one_hot(labels, number_of_classes): number_of_observations = labels.size() labels_one_hot = torch.zeros(number_of_observations, number_of_classes) for i in xrange(number_of_observations): label_value = labels[i] labels_one_hot[i, label_value] = 1.0 return labels_one_hot
Anyway, there is no backprop to this, so this shouldn’t cause problems. In addition, each row of this matrix contains a single 1, with all the other elements being 0, so it is a valid probability.
Now, the weird thing is that the loss function is negative. That just shouldn’t happen, considering that KL divergence should always be a nonnegative number. For 5 epochs, the results of the loss function are:
[1, 200] loss: -0.019 [2, 200] loss: -0.033 [3, 200] loss: -0.036 [4, 200] loss: -0.038 [5, 200] loss: -0.040
Anyone had similar problems in the past? Thanks in advance!