Autograd inconsistent / nan gradients

ciupakabra · October 26, 2021, 3:49pm

x = torch.tensor([0.0, 1.0], requires_grad=True)
y = torch.log(x)
grads, = torch.autograd.grad(y[1], x)
print(grads[0])

outputs nan, whereas

x = torch.tensor([-1.0, 1.0], requires_grad=True)
y = torch.log(x)
grads, = torch.autograd.grad(y[1], x)
print(grads[0])

outputs 0.0

Is this the expected output? If so, are there any quick hacks to make the gradients 0.0?

torch 1.9.1+cu111

Thank you!

albanD · October 26, 2021, 5:12pm

Hi,

This is unfortunately the expected output yes.
Since log is not defined on the negative side, the gradient it generates can be anything. The choice is made to make the op as fast as possible.

To make the gradients 0 you can simply make sure that you never pass an invalid input to log.
In this case by doing log(x[1]) but in general check entries for x>0 before passing them to log.

Francisco_Vargas · October 26, 2021, 6:28pm

I am a bit confused about this rationale. The argument seems to be that when we provide something outside the range of the function then the gradients will be some set value that is incorrect (and that could be anything) , however we can see in the above example that the returned gradient in the x<0 case is 0 which is in fact the correct gradient. I made a slightly different example to illustrate it does not always return 0 and in fact seems to be returning the correct gradient:

from torch.autograd.functional import jvp, vjp

x2 = torch.tensor([-1.0, 1.0], requires_grad=True)
def log_1(x):
  return torch.log(x[1]) + 2 * x[0]

print(jvp(log_1, x2, torch.eye((2))[0,:])[1])
# prints 2

The issue here is that the first snippet in OPs examples is returning nans whilst the second snippet is returning the “correct” answer mathematically. Why is it not possible to make both return nans or both return the correct gradients ? why would this be expected behaviour ? it feels like theres some set logic differentiating these two and the inconsistency really does not help with debugging. Is the difference that one outputs nans vs the other infs ?

albanD · October 26, 2021, 7:09pm

the x<0 case is 0 which is in fact the correct gradient.

Not sure how you define “correct gradient” here? The function has no value there.

Why is it not possible to make both return nans or both return the correct gradients ?

We just use 1/x for the gradient of log. So when x is 0, it get to inf/nan. But when it is negative, you just get other wrong values.

Francisco_Vargas · October 26, 2021, 8:14pm

Its not well defined formally I agree (cant take a limit of the original function over values it is not defined at) however the function does not depend on x[0] thus one handwavy guess to its behaviour would be that it has 0 derivative when differentiated on a variable it does not depend on that is (d log(x) / dy |_{x=-1} = 0), nonetheless the correct thing here I would think is returning either an error or a value that clearly indicates something is wrong such as “nan”

We just use 1/x for the gradient of log. So when x is 0, it get to inf/nan. But when it is negative, you just get other wrong values.

This is my currnet confusion, its not like you are returning “random” wrong values, it is returning the gradient as though the function where defined for the provided out of range values (see the example above and the OPs) and this is problematic from the users viewpoint as one can think that this is the “correct gradient” when in fact the notion of gradient here is not well defined. What would be the issue of returning nan ? as oposed to the currently value that is rather decieving ?

We just use 1/x for the gradient of log. So when x is 0, it get to inf/nan.

I think its possible you might have missed a detail in the the OPs example. y = [log(x1) , log(x2)] the formal derivative of log(x2) wrt to x1 (which is what the OP computes) is not 1/x , it is 0. So I dont think this comment is particularly relevant. No disagreement with what the derivative of log is of course … without thinking about reverse mode diff you would expect the first snippet to return 0 not nan as the derivative is not 1/x it is 0.

Where it becomes confusing is that the gradient becomes 0 (gives the value you would expect) when you pass log a negative number, this feels like completely inconsistent / unexpectd behaviour.

I understand the point being made in order to be eficient pytorch decides to compute some derivative (which is wrong) in these particular scenarios such that compute is optimised but I think our point is that these particular scenarios give quite deceitful answers.

Also note the OP is asking for a way to make the gradients of the first snippet 0 where x is not negative, the response you have provided is focusing on the second snippet where the gradients are already 0 which is what the OP wants, so I think its possible this point was missed in both responses and its still something we are a bit confused with. A user could argue that these wrong answers seem correct mathematically since d log(x) / dx = 1/x as a function can be evaluated for negative x and this is exactly what torch is doing when provided with negative values:

d log(x) / dx |{x=-1,y=1} = -1 and d log(x) / dy |{x=-1,y=1} = 0 (this is what pytorch is outputing)

meanwhile in the first snippet (when x=0):

d log(x) / dy |_{x=0, y=1} = nan (when it should be 0 ?)

In shorter terms I would expect the jacobian of log([x, y]) at x=0, y=1 to be diagonal however that does not seem to be the case, meanwhile the jacobian at x=-1 and y=1 is in fact diagonal. I dont understand how this inconsistency can be expected behaviour ?

Why is this the case ? as mentioned earlier we expect this to be 0. I am not sure the question is being address I think some details on the OPs example have been lost in the responses.

Francisco_Vargas · October 26, 2021, 8:26pm

To clarify by correct gradient I meant the function 1/x as the gradient of log(x) is only defined in R^+ however its definition extends to the whole real line. When passed negative values to grad we obtain -1/|x| as the gradient and whilst formally that may be incorrect its an odd choice as it “feels” correct.

The specific example given by op considers d/dy (log(x) ){x=0,y=1} and d/dy (log(x) ){x=-1,y=1}, I think it would feel expected that the first is 0 (log(x) does not depend on y) and the second should be a nan or a “wrong value” as you say however for the first we get nan and for the second we get 0 which is the result we expect for the first and seems plausible (yet incorrect) for the second.

albanD · October 27, 2021, 2:54pm

Thanks for writing this down in details.

I think there are some confusions in your arguments on what is the function that is being differentiated.
In particular, the user function above includes the indexing part and that is what makes all the <0 gradients 0, because the gradient is then 0 * 1/x which is going to be 0 for all non zero x. And you can also see where the nan comes from when x = 0.

I would also add that the chain rule that we use here is only properly defined when each constituent functions are continuously differentiable at the point where you evaluate them.
So even if the full function has well defined behavior, if you use a constituent at a point where it is not continuously differentiable, anything can happen (even though we try to get something sensible). In this case, you do evaluate log at x <=0 so all the theoretical guarantees disappear
A simple example of that is to consider the identity function (gradient 1 everywhere) and write it as relu(x) - relu(-x). You will see that this gives you a gradient of 0 at 0 and there is nothing we can do about it.

What would be the issue of returning nan ?

Speed mainly. Checking if the input is positive or negative would be 2x the work of the current formula. And setting nans for the negative would end up being 3x slower in total (also use extra memory for the temporary mask/indices).