Getting NaN in backward


I am programming the ladder network and realize about a possible bug in your backward function. When ever the result of a variable which is part of the cost is 0, the backward method evaluate a NaN.

In this particular case the denoising function is given by:

z_denoised = (z_level-mu_i)·v_i+mu_i

where z_level is the corrupted signal before the activation and the addition of BN params, and z_proj is the latent variable projected using BN(V·z^(l+1)).

I have checked the parameter initialization from the github AI ladder network code and use same initialization. In this case all the denoising function params “a” are initialized to zero, execpt a2 and a7. This means that in the first batch epoch the z_denoised value is 0 because mu_i and v_i are zero. In this case I am getting nan in the grad attribute from all the parameters of the ladder network (because all of them influence either z_proj or z_level). Whenever I change the initialization of “a” to 0.00001, everything works fine and I get a test error of 0.76 more or less (which means my implementation seems to be nearly ok).

Moreover If I use as denoising function something like:


I also get nan. However if I use simply:


Everything works fine. I think there is a problem in backward when evaluating a gradient at 0. Remark that the SSE function is doing:

MSE(x-x_recon), where x_recon is 0 in the first batch epoch, however x is not, as it is the clean projection, this means there is error to backpropagate. Moreover the error in the first epoch is like 180 so it is not as big that the update could cause NaN, and NaN are observed before optimizer.step()

Thanks. I could provide the code I am using if you request.

Could you please provide a minimal example that shows this? I’ve tried the following to test:

x = Variable(torch.zeros(1), requires_grad=True)
out = x * 3
x.grad  # prints 3

but it isn’t as complex of an example as what you have in mind.

well actually I am training the ladder network which is quite complex, so I am not sure I could provide a simple example where it happens. Maybe we need a huge graph as with this minimal example everything works.

The only division by zero I have in my code is in batch normalization, however I prevent this using epsilon which know is fixed to 1. I print the cost after the first forward and it is like 190 an I do not see variables at NaN, is just after calling backward.

When I change the init value from this variables let say to 0.00000000001 instead of 0 I get my ladder network under 0.70

i tried to replicate a simple example but I cannot get one. Maybe I could save the model parameters and send.


One more point. Today I update the parameter initialization, changing the bias initialization from 1 to 0. Now I can initialize the denoising params to 0 an everything works. It seems to be an internal problem in pytorch, as the input to the derivable variable is:


where in the upper layer the proj comes from a batch normalization of the sotmax activation of a linear projection +BN+ noise +add batch norma parameters. The bias is included as the beta of the batchnorm, this means, the linear projection only perform matrix multiplication.

Then just use

z_denoised = (z_level-mu_i)*v_i+mu_i

where z_level is the linear projection +BN+ noise

This should not give nan for having a bias initialize to one…, when evaluating the MSE of the z_denoised vs the z_clean, which have same transformation as the z_level except that we should not add noise to it.


I change some things in the code. The point was not initialize the bias to 0. It has to do with batch normalization. When finish I will push the code to github so the pytorch developers can fix the problem.