Nan grads after few training steps

ricky123 · February 25, 2020, 6:09pm

So, I am trying to design a model for unsupervised object detection. As of now, its task is to locate the digits in a multi-MNIST digit dataset and draw a bounding box around them. I am using a variational-autoencoder based architecture and it is a modification over this paper.

Now I am facing a lot of problems in implementation. When I am running the code, after few batches, the weights are becoming Nan. Therefore I checked all the gradients of all the parameters and found that after a few steps the KL-divergence of the Z_pres variable is becoming Nan and moreover, the standard deviation of the gradient of the bias of glimpse_decoder and z_pres encoder are becoming Nan just after the first training batch. My questions are:

Under what condition can these incidents happen? Does it have to do anything with network architecture? Can awry network lead to Nan gradients?
What does affine_grid and grid_sample exactly do? I read the documentation but could not exactly understand their purpose. I used them to take a glimpse from the image based on z_where. So, is its purpose similar to cropping a picture? Someone please explain the arguments of F.affine_grid, what do they actually expect in the translation arguments? Can this cause Nan ever?
Please take a look at my code, this single error has caused me lot of wastage of time. I will be grateful if someone figures out what is wrong in this.
i)spair.py
ii)train_mnist.py

Thanks in advance

MartinBagge · February 26, 2020, 9:49am

Have you given exploding gradients a thought?
I would think it is the most likely scenario for your nan gradients.
You can try to reduce the learning rate and check whether that helps.

G.M · February 26, 2020, 1:59pm

Division by zero or reaching infinity somewhere(not limited to these cases) can cause Nan.
I don’t think these two can cause Nan. These 2 functions are generally used Spatial Transformer Networks. affine_grid apply linear transformation like rotating, translating, scaling(I think sheering is also part of the transformation but I’m not sure) to the input image. grid_sample uses the coordinates generated by affine_grid to perform a bi-linear interpolation on the original image to form a result image. https://pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html
Ur code is quite long so I only took a glance. My suggestion is to check the part where u used .exp(), I had expierenced Nan using expontenial operations tho it may not be the case for u. If that’s the case, then the Nan in gradients can also be explained. U can also use tc.any(tc.isnan(INPUT)) to see where did Nan show up.

Hope this helps

ricky123 · February 28, 2020, 9:44am

Thank you so much for your reply. Can you explain what does the argument of F.affine_grid represent? Like, it is a b23 matrix where in that 2*3 matrices, it contains some affine transformation arguments. Can you explain which element is for scaling and which is for translation?

G.M · February 28, 2020, 11:03am

I didn’t read the paper carefully so I might be wrong. For the 2*3 matrix, 1 value is scaling, 1 for x-translation, 1 for y-translation, 1 for rotating, 2 for sheering. I’m not sure about the ordering tho, u should take a look at the paper.