NaN gradient for torch.cos() / torch.acos()

Paralysis · November 7, 2017, 5:33pm

Hello, I am trying to do the following forward calculation:
y_ij = ||x_i||*cos(2<x_i,w_j>)
where x_i and w_j are vectors from matrix X and W. y_ij is an element in resulting matrix Y.

There are two equivalent ways to realize it:

xlen = x.pow(2).sum(1).pow(0.5).view(-1, 1)  # ||x||
wlen = w.pow(2).sum(0).pow(0.5).view(1, -1)  # ||w||
cos_theta = (x.mm(w) / xlen / wlen).clamp(-1, 1)
theta = cos_theta.acos()
cos_2_theta = torch.cos(2*theta)
y = cos_2_theta * xlen.view(-1, 1)

Alternatively,

xlen = x.pow(2).sum(1).pow(0.5).view(-1, 1)  # ||x||
wlen = w.pow(2).sum(0).pow(0.5).view(1, -1)  # ||w||
cos_theta = (x.mm(w) / xlen / wlen).clamp(-1, 1)
cos_2_theta = 2 * cos_theta ** 2 - 1  # cos(2x) = 2cos(x)^2-1
y = cos_2_theta * xlen

However, the first one is very unstable, i.e. gradients turns to NaN after several iterations. While the second one is good. Can anyone explain this issue?

Thanks!

richard · November 9, 2017, 9:56pm

When you do backprogation with the first, at some point you’ll run into the derivative of acos(x), which is - 1 / sqrt( 1 - x^2 ). That can be nasty and lead to your NaNs if x is close to 1 or -1 at times.

In particular, consider the following two functions: f(x) = cos(acos(x)) and g(x) = x. They’re almost equivalent (except for when x = 1, -1). When one needs to backprop against g(x), life is easy: for some operation z on the output y = g(x), the chain rule gives you dz/dy * dy/dx = dz/dy.

On the other hand, with y = f(x), the backpropagation looks like:
dz/dy * dy/dx = dz/dy * (- sin (acos (x) ) (- 1/ sqrt(1 - x^2))
If x is close to 1 or -1, this could be very bad.