As far as I now loss.backward() only works if loss is a single scalar value. I am considering a loss function that is not implemented in pytorch https://arxiv.org/pdf/1905.09670.pdf (equation 5) and I want to avoid numerical problems.

I know that a way to stabilize the softmax + log score loss is to compute the derivative directly (as Bishop’s book states) and then backpropagate that derivative, instead of defining the loss using elementary functions and then use automatic differentiation.

I would like to do something similar in this problem. However, the problem is that if I manually compute the derivative of this loss w.r.t the logits, then I end up having a vecor/matrix instead of a single scalar value, so I can no use .backward() anymore. How should I proceed in this case, i.e, is there an automatic way to backpropagate gradients of a vector w.r.t to all the inputs to the graph?

backward() accepts a gradient as its argument.
If you don’t pass anything to this method, a scalar 1 is automatically used.
If you already have the gradient w.r.t the tensor, you could thus use loss.backward(gradient).

I have reviewed the documentation and have one more question. So lets say that gradient has my craft derivatives of the probability vector w.r.t logits that I have to backpropagate. I guess the correct call now would be:

gradients.backward(torch.ones_like(gradients))

Is this equivalent to do gradients.sum().backward()? I guess it is.