Dealing with NaN's in gradients

dpernes · July 28, 2017, 6:23pm

I am trying to train a Mixture Density Network, by reproducing the results of the toy example of section 5 of Bishop’s paper, where MDNs were initially proposed (link to the paper: https://publications.aston.ac.uk/373/1/NCRG_94_004.pdf).

I am using exactly the same network architecture proposed there (single hidden layer with 20 neurons and tanh activation). The only difference is that I am using SGD instead of BFGS as my optimization algorithm.

Unfortunately, after 2k or 3k iterations (where the loss reduces considerably), I start getting NaN’s as the loss value. After some intense debug, I finally found out where these NaN’s initially appear: they appear due to a 0/0 in the computation of the gradient of the loss w.r.t. the means of the gaussian.

I am using negative log-likelihood as the loss function, L=-sum(log(p_i)). Therefore, the gradient of L w.r.t. p_i is

dLdp= -1/p_i

However, the derivative of a Gaussian function w.r.t. its mean, mu, is:

dpdmu = p_i * (x_i-mu_i)/(sigma_i^2)

so it is proportional to p_i. Therefore, when p_i is close to 0, the derivative of the loss w.r.t. the mean is:

dLdmu = dLdp * dpdmu = 0/0 -> NaN

However, this indetermination is easy to eliminate, since the expression may be algebraically simplified to dLdmu = (mu_i - x_i)/sigma_i^2. Of course, all that PyTorch does is numeric computing, so it is not able to do this simplification. How can I deal with this issue? Can I at least replace NaN’s with something else (zeros, for instance), so that they do not propagate?

(In all my reasoning I have assumed a MDN with one single Gaussian kernel, which is kind of stupid, but similar results roughly apply if we consider multiple kernels.)

Thank you in advance.

tom · July 28, 2017, 9:01pm

Hello @dpernes

I think it is most common to give the expression for the log likelihood in “one go” in that you do not compute dLdp.
The weights of your components will sum to one, so you can compute the per-component log likelihoods and then do the usual log-sum-exp stabilisation to safely get from per-component log likelihood to the full one.

Best regards

Thomas

dpernes · July 29, 2017, 2:28am

Hi @tom. Thank you for your reply!

Do you mean that I should write the expression for the log of p directly, instead of computing p and then applying log() to p? That seems a good idea, I will check if that works.

EDIT: After thinking a bit, that does not seem to be useful when I have more than one Gaussian component. Suppose that my Gaussian mixture has two components. Then, p is given by:

p = a0exp(b0(x-u0)^2) + a1exp(b1(x-u1)^2)

where a0, b0, a1 and b1 are some constants and u0 and u1 are the means. If we apply log() to p, there is no obvious way to simplify the expression, since we get the logarithm of a sum…

tom · July 29, 2017, 9:10pm

Hello,

That is the log-sum-exp computation I mentioned.
I have implemented something that sounds similar to what your description in the GaussianMixture1d’s forward method in this
1d Mixture Density Network notebook.

Best regards

Thomas

dpernes · July 31, 2017, 10:15am

Got it, thank you!!

JethroJC · October 16, 2018, 1:44am

Hi. I met the same problem as you. Could you please show me your modified code about loss computation?Thanks!

dpernes · October 17, 2018, 1:22pm

Hi Jethro,

Unfortunately, I can’t find the source code (this was for a toy example that I might have deleted).
However, scipy has an (open source) implementation of the log-sum-exp operation (see below), which is easy to adapt for PyTorch.

Moreover, the idea behind this function is very simple, so if you understand it you might not even need to look at the scipy code. Let me explain.

Log-sum-exp, like its name says, computes the logarithm of a sum of exponentials:

If the values inside the exponentials are all large negative numbers, the sum inside the log vanishes and we get the logarithm of 0, so we’re in trouble. To avoid this, the following simple trick is used:

Find the maximum value among the a_i’s:
Compute b using:

Now, at least one of the values inside the exponentials will be 0, and exp(0) = 1, so we will certainly not get log(0). Thus, we now have a numerically robust implementation of the log-sum-exp operation.

Hope this explanation helps

github.com

scipy/scipy/blob/v0.19.1/scipy/special/_logsumexp.py#L8-L127


      
          def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False):
              """Compute the log of the sum of exponentials of input elements.
          
              Parameters
              ----------
              a : array_like
                  Input array.
              axis : None or int or tuple of ints, optional
                  Axis or axes over which the sum is taken. By default `axis` is None,
                  and all elements are summed.
          
                  .. versionadded:: 0.11.0
              keepdims : bool, optional
                  If this is set to True, the axes which are reduced are left in the
                  result as dimensions with size one. With this option, the result
                  will broadcast correctly against the original array.
          
                  .. versionadded:: 0.15.0
              b : array-like, optional
                  Scaling factor for exp(`a`) must be of the same shape as `a` or

This file has been truncated. show original

JethroJC · October 22, 2018, 3:06am

Thank you very much, your answer has given me a lot of help.

dpernes · October 30, 2018, 12:02pm

UPDATE: At least from version 0.4.1, PyTorch has a built-in log-sum-exp function. See link below.

https://pytorch.org/docs/stable/torch.html#torch.logsumexp

whoab · May 3, 2021, 6:55am

I am still getting nans for some reason, trying to train mixture of gaussians

whoab · May 3, 2021, 7:04am

Like @dpernes , the issue is with the sum of exponentials, in my case wrt the categorical variable. Applying his trick works. Also, see “get_mixture_coef” here: Mixture Density Networks with TensorFlow | 大トロ

whoab · May 3, 2021, 7:17am

Still not fixed actually. Check this repo, the categorical logits are nan’ing out:

Okay, I fixed it by applying a tanh before the final linear layer. No more nans, the tanh chokes the magnitude of the final layer…