#convert class ID to one hot encoding
one_hot = torch.zeros_like(y_pred)
one_hot[range(y_target.shape[0]), y_target] = 1.0
ce = -one_hot*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))
result = ce.sum(dim=1).mean()

Can someone help me with making derivative of this with respect to y_pred in python code? I just can not figure it out… What is backward of this losses function?

I see. It’s probably better to write the equations here in mathematical terms, show us where you get stuck while applying chain rule and we can take it from there.
If you’re having difficulty with CE derivative then you’ll easily find some blogs and stuff talking about it’s derivative like:

The question is still whether you’re struggling with the math or the implementation.
Also note that CE loss is usually used in conjunction with a softmax layer because the derivative of softmax+CE is really nice and easy to implement.

So, I build my own backward of linear layers and activation functions, and I was using only MSE with that, now I want to add Crossentropy and I am not able to make its backward.

The answer in simple terms is that the derivative would be softmax(y_pred)/batch_size when y_pred ≠ y_target and it will be (softmax(y_pred) - 1)/batch_size when y_pred==y_target.
You can try it in PyTorch.
The equation basically translates to: (F.softmax(a, dim=1)/a.shape[0]) * (1- F.one_hot(y_t)) + (F.softmax(a, dim=1) - 1)*F.one_hot(y_t)/a.shape[0]
where a is the input to the CE loss and y_t represents the target indices.
Or more simply: (F.softmax(a, dim=1) - F.one_hot(y_t))/a.shape[0]

The backwards of cross entropy is as simple as logits - predictions and (scale it for the reduction i.e mean, sum or weighted mean), where logits are the output of the softmax layer and predictions are the one hot encoded labels. So basically