Hi,

I was just experimenting with pytorch. I implemented a cross-entropy loss function and softmax function as below

```
def xent(z,y):
y = torch.Tensor(to_one_hot(y,3)) #to_one_hot converts a numpy 1D array to one hot encoded 2D array
y_hat = pt_softmax(z)
loss = -y*torch.log(y_hat)
loss = loss.mean()
return loss
def pt_softmax(x):
exps = torch.exp(x - torch.max(x,dim=1)[0].unsqueeze(1))
return exps / torch.sum(exps,dim=1).unsqueeze(1)
```

And I was comparing this loss with `nn.CrossEntropyLoss`

and found that `nn.CrossEntropyLoss`

converges faster on wine dataset on UCI repo. Also, the weights and gradients obtained after each epoch was different for both the losses. I am using batch gradient descent and not getting any nan values.

Can anyone please let me know why this is happening? Is it because of unstable implementation of `xent`

or due to some other reason?