# Backward of crossentropyloss

Hi, this is crossentropyloss

``````#convert class ID to one hot encoding
one_hot = torch.zeros_like(y_pred)
one_hot[range(y_target.shape), y_target] = 1.0

ce = -one_hot*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))
result = ce.sum(dim=1).mean()
``````

Can someone help me with making derivative of this with respect to y_pred in python code? I just can not figure it out… What is backward of this losses function?

Thank you for help.

PyTorch will automatically calculate the derivative for you when you do `.backward()`. But I guess you’re looking for something else?

Hi, i know backward can do it. But iam loking for manually derived function.

I see. It’s probably better to write the equations here in mathematical terms, show us where you get stuck while applying chain rule and we can take it from there.
If you’re having difficulty with CE derivative then you’ll easily find some blogs and stuff talking about it’s derivative like:

or

Hi thank you, lets think about MSE loss:
This is forward:

``````((labels - prediction) ** 2).mean()
``````

And this is backward:

``````N = labels.shape
first_grad = -2*(labels - prediction) / N
``````

First grad is then used to backpropagate next layer and get gradients of weights and bias.

The question is how to do the same with cross entropy loss?
This is forward of it:

``````
ce = -labels*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))
result = ce.sum(dim=1).mean()
``````

How to do backward from this?

I’m not sure if your implementation of CE loss is correct. You need to multiply `y_pred` with the one-hot encoding before taking the sum. Check the definition (eq. 1) of CE here: Cross entropy - Wikipedia or here Loss Functions — ML Glossary documentation

You can also call PyTorch’s CE loss function and ensure that you’re getting the same result via your code and PyTorch’s implementation.

The question is still whether you’re struggling with the math or the implementation.
Also note that CE loss is usually used in conjunction with a softmax layer because the derivative of softmax+CE is really nice and easy to implement.

I am struggling with implementation, I can not write backward. this forward is going 1:1 with pytorch cross entropy. so forward is good.

Here is proof, results are same as torch crossentropy:

``````import torch
def CustomCrossEntropyB(y_pred, y_target):
#convert class ID to one hot encoding
one_hot = torch.zeros_like(y_pred)
one_hot[range(y_target.shape), y_target] = 1.0

#compute cross entropy without softmax
ce = -one_hot*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))

#reduction mean
result = ce.sum(dim=1).mean()
return result

if __name__ == "__main__":
batch_size      = 3
classes_count   = 10
target_class_id = torch.randint(0, classes_count, (batch_size, ))

#initial logits value
x_initial = torch.randn((batch_size, classes_count))

#reference, computed using torch cross entropy loss

loss_func_r = torch.nn.CrossEntropyLoss()

#custom, computed using cusotm cross entropy loss

loss_func_c = CustomCrossEntropyB

#find (train) logits
for i in range(100):

loss_r = loss_func_r(xr, target_class_id)
loss_r.backward()
optimizer_r.step()

loss_c = loss_func_c(xc, target_class_id)
loss_c.backward()
optimizer_c.step()

print(loss_r, loss_c)

print("\n\n")
print("final logis values : ")
print(xr)
print(xc)
``````

So, I build my own backward of linear layers and activation functions, and I was using only MSE with that, now I want to add Crossentropy and I am not able to make its backward.

The answer in simple terms is that the derivative would be `softmax(y_pred)/batch_size` when `y_pred ≠ y_target` and it will be `(softmax(y_pred) - 1)/batch_size` when `y_pred==y_target`.
You can try it in PyTorch.
The equation basically translates to:
`(F.softmax(a, dim=1)/a.shape) * (1- F.one_hot(y_t)) + (F.softmax(a, dim=1) - 1)*F.one_hot(y_t)/a.shape`
where `a` is the input to the CE loss and `y_t` represents the target indices.
Or more simply:
`(F.softmax(a, dim=1) - F.one_hot(y_t))/a.shape`

The backwards of cross entropy is as simple as `logits - predictions` and (scale it for the reduction i.e mean, sum or weighted mean), where logits are the output of the softmax layer and predictions are the one hot encoded labels. So basically

``````first_grad = (softmax(prediction) - labels) / N
``````

Also, I tried to add the derivation of why this is the answer: Deriving categorical cross entropy and softmax | Shivam Mehta .