Comparison of nn.CrossEntropyLoss with custom written cross-entropy loss?

Hi,

I was just experimenting with pytorch. I implemented a cross-entropy loss function and softmax function as below

def xent(z,y):
    y = torch.Tensor(to_one_hot(y,3))    #to_one_hot converts a numpy 1D array to one hot encoded 2D array
    y_hat = pt_softmax(z)
    loss = -y*torch.log(y_hat)
    loss = loss.mean()
    return loss

def pt_softmax(x):
    exps = torch.exp(x - torch.max(x,dim=1)[0].unsqueeze(1))
    return exps / torch.sum(exps,dim=1).unsqueeze(1)

And I was comparing this loss with nn.CrossEntropyLoss and found that nn.CrossEntropyLoss converges faster on wine dataset on UCI repo. Also, the weights and gradients obtained after each epoch was different for both the losses. I am using batch gradient descent and not getting any nan values.

Can anyone please let me know why this is happening? Is it because of unstable implementation of xent or due to some other reason?

If you test with a given z and y, you should get very similar values between your version and the official one.
You should check that this is the case first.

Then if this the true and the training behavior are still different, you can try with different random seeds. Sometimes, the random initialization can have a significant impact on the speed of convergence of the network.

Thanks @albanD for replying. You are correct the values obtained are exactly similar for same z and y. My implementation of xent was wrong. Here is correct implementation.

def xent(z,y):
    y = torch.Tensor(to_one_hot(y,3))
    y_hat = pt_softmax(z)
    loss = -y*torch.log(y_hat)
    loss = loss.sum()/y.shape[0]
    return loss

Previously I was taking mean over all the elements but I have to sum the elements first then take the average w.r.t batch size.

1 Like