Difference between computing grad with torch.autograd.grad() and .backward + .grad

Hi all, Im having difficults to do a differential machine learning model, because it seems im having problem in the training part, more specially in the loss part.

The differential ML model has this loss:

loss = 0.5 * self.criterion(output, label) + 0.5 * self.criterion(gradients * self.lambda_j, dydx_batch * self.lambda_j)

While the stantard ML model has this one:

loss = self.criterion(output, label)

In the diff one, when I use:

                gradients = torch.autograd.grad(outputs=output,
                                                inputs=input,
                                                grad_outputs=output.data.new(output.shape).**fill_(0)**,
                                                create_graph=True,
                                                retain_graph=True,
                                                allow_unused=True)[0]

i have the same result of the standard model, and the same result if i calculate the gradients as:

output.sum().backward(retain_graph=True)
gradients = input.grad

In the other hand, when i do:

                gradients = torch.autograd.grad(outputs=output,
                                                inputs=input,
                                                grad_outputs=output.data.new(output.shape).**fill_(1)**,
                                                create_graph=True,
                                                retain_graph=True,
                                                allow_unused=True)[0]

the result is different but the prevision is a little bit worst…

So, whats the reason to use fill_(1), because i tested other values, for example, 2 or .-1, and the results where even worst?

And the other doubt: whats the difference using one way to calculate gradients and the other?

This is expected since passing a zero gradient as grad_outputs will create a gradients output containing zeros, too.

The grad_outputs define the passed in gradient, which is a 1. implicitly set in the backward call for scalar tensors.

The torch.autograd.grad call computes the specified gradients explicitly instead of the full backward pass and avoid unnecessary computations (e.g. it won’t compute the weight gradients unless specified in the inputs).