I’m using a loss function that returns a tensor (separate loss for each element). I understand that I do have to set the loss.backwards(gradient=???) parameter. What I don’t understand is how to obtain said gradients. Following the answer to a previous question I set up the training step as follows. However, this leads to no useful training (which makes sense since gradients are always 1?)

Normally we compute a loss function in order to train a network by
minimizing the loss function. (And we compute the gradient of the
loss so that we can minimize the loss with some version of gradient
descent.)

If your loss function results in a tensor, how do you propose to train
your network? Minimizing the loss for, say, element 1 will not, in
general, also minimize the loss for element 2.

What we generally do is minimize some weighted combination of
those per-element losses. But that weighted combination just becomes
our single scalar loss that we minimize by calling loss.backward(),
etc.

If you actually need to compute the gradient for each separate per-element
loss, then you need to run multiple backward passes in a loop.

You can use pytorch’s jacobian() function to run this loop for you, or you
can run it by hand.

Some comments about what is going on when you use gradient = torch.ones_like (loss) can be found in this post: