Backward of crossentropyloss

Hi, this is crossentropyloss

#convert class ID to one hot encoding
one_hot = torch.zeros_like(y_pred)
one_hot[range(y_target.shape[0]), y_target] = 1.0

ce = -one_hot*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))
result = ce.sum(dim=1).mean()

Can someone help me with making derivative of this with respect to y_pred in python code? I just can not figure it out… What is backward of this losses function?

Thank you for help.

PyTorch will automatically calculate the derivative for you when you do .backward(). But I guess you’re looking for something else?

Hi, i know backward can do it. But iam loking for manually derived function.

I see. It’s probably better to write the equations here in mathematical terms, show us where you get stuck while applying chain rule and we can take it from there.
If you’re having difficulty with CE derivative then you’ll easily find some blogs and stuff talking about it’s derivative like:


Hi thank you, lets think about MSE loss:
This is forward:

((labels - prediction) ** 2).mean()

And this is backward:

N = labels.shape[0]
first_grad = -2*(labels - prediction) / N

First grad is then used to backpropagate next layer and get gradients of weights and bias.

The question is how to do the same with cross entropy loss?
This is forward of it:

ce = -labels*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))
result = ce.sum(dim=1).mean()

How to do backward from this?

I’m not sure if your implementation of CE loss is correct. You need to multiply y_pred with the one-hot encoding before taking the sum. Check the definition (eq. 1) of CE here: Cross entropy - Wikipedia or here Loss Functions — ML Glossary documentation

You can also call PyTorch’s CE loss function and ensure that you’re getting the same result via your code and PyTorch’s implementation.

The question is still whether you’re struggling with the math or the implementation.
Also note that CE loss is usually used in conjunction with a softmax layer because the derivative of softmax+CE is really nice and easy to implement.

I am struggling with implementation, I can not write backward. this forward is going 1:1 with pytorch cross entropy. so forward is good.

Here is proof, results are same as torch crossentropy:

import torch
def CustomCrossEntropyB(y_pred, y_target):
    #convert class ID to one hot encoding
    one_hot = torch.zeros_like(y_pred)
    one_hot[range(y_target.shape[0]), y_target] = 1.0

    #compute cross entropy without softmax
    ce = -one_hot*(y_pred - y_pred.exp().sum(1).log().unsqueeze(1))

    #reduction mean
    result = ce.sum(dim=1).mean()
    return result

if __name__ == "__main__":
    batch_size      = 3
    classes_count   = 10
    target_class_id = torch.randint(0, classes_count, (batch_size, ))

    #initial logits value
    x_initial = torch.randn((batch_size, classes_count))

    #reference, computed using torch cross entropy loss
    xr = torch.nn.Parameter(x_initial.clone(), requires_grad=True)

    loss_func_r = torch.nn.CrossEntropyLoss()
    optimizer_r  = torch.optim.Adam([xr], lr=0.5)

    #custom, computed using cusotm cross entropy loss
    xc = torch.nn.Parameter(x_initial.clone(), requires_grad=True)

    loss_func_c = CustomCrossEntropyB
    optimizer_c = torch.optim.Adam([xc], lr=0.5)

    #find (train) logits
    for i in range(100):

        loss_r = loss_func_r(xr, target_class_id)

        loss_c = loss_func_c(xc, target_class_id)
        print(loss_r, loss_c)

    print("final logis values : ")

So, I build my own backward of linear layers and activation functions, and I was using only MSE with that, now I want to add Crossentropy and I am not able to make its backward.

The answer in simple terms is that the derivative would be softmax(y_pred)/batch_size when y_pred ≠ y_target and it will be (softmax(y_pred) - 1)/batch_size when y_pred==y_target.
You can try it in PyTorch.
The equation basically translates to:
(F.softmax(a, dim=1)/a.shape[0]) * (1- F.one_hot(y_t)) + (F.softmax(a, dim=1) - 1)*F.one_hot(y_t)/a.shape[0]
where a is the input to the CE loss and y_t represents the target indices.
Or more simply:
(F.softmax(a, dim=1) - F.one_hot(y_t))/a.shape[0]

The backwards of cross entropy is as simple as logits - predictions and (scale it for the reduction i.e mean, sum or weighted mean), where logits are the output of the softmax layer and predictions are the one hot encoded labels. So basically

first_grad = (softmax(prediction) - labels) / N

Also, I tried to add the derivation of why this is the answer: Deriving categorical cross entropy and softmax | Shivam Mehta .