# NaN gradient for torch.cos() / torch.acos()

Hello, I am trying to do the following forward calculation:
y_ij = ||x_i||*cos(2<x_i,w_j>)
where x_i and w_j are vectors from matrix X and W. y_ij is an element in resulting matrix Y.

There are two equivalent ways to realize it:

``````xlen = x.pow(2).sum(1).pow(0.5).view(-1, 1)  # ||x||
wlen = w.pow(2).sum(0).pow(0.5).view(1, -1)  # ||w||
cos_theta = (x.mm(w) / xlen / wlen).clamp(-1, 1)
theta = cos_theta.acos()
cos_2_theta = torch.cos(2*theta)
y = cos_2_theta * xlen.view(-1, 1)
``````

Alternatively,

``````xlen = x.pow(2).sum(1).pow(0.5).view(-1, 1)  # ||x||
wlen = w.pow(2).sum(0).pow(0.5).view(1, -1)  # ||w||
cos_theta = (x.mm(w) / xlen / wlen).clamp(-1, 1)
cos_2_theta = 2 * cos_theta ** 2 - 1  # cos(2x) = 2cos(x)^2-1
y = cos_2_theta * xlen
``````

However, the first one is very unstable, i.e. gradients turns to NaN after several iterations. While the second one is good. Can anyone explain this issue?

Thanks!

1 Like

When you do backprogation with the first, at some point you’ll run into the derivative of `acos(x)`, which is `- 1 / sqrt( 1 - x^2 )`. That can be nasty and lead to your NaNs if x is close to 1 or -1 at times.

In particular, consider the following two functions: `f(x) = cos(acos(x))` and `g(x) = x`. They’re almost equivalent (except for when `x = 1, -1`). When one needs to backprop against `g(x)`, life is easy: for some operation `z` on the output `y = g(x)`, the chain rule gives you `dz/dy * dy/dx = dz/dy`.

On the other hand, with `y = f(x)`, the backpropagation looks like:
`dz/dy * dy/dx = dz/dy * (- sin (acos (x) ) (- 1/ sqrt(1 - x^2))`
If `x` is close to 1 or -1, this could be very bad.

3 Likes