Nan grads after few training steps

So, I am trying to design a model for unsupervised object detection. As of now, its task is to locate the digits in a multi-MNIST digit dataset and draw a bounding box around them. I am using a variational-autoencoder based architecture and it is a modification over this paper.

Now I am facing a lot of problems in implementation. When I am running the code, after few batches, the weights are becoming Nan. Therefore I checked all the gradients of all the parameters and found that after a few steps the KL-divergence of the Z_pres variable is becoming Nan and moreover, the standard deviation of the gradient of the bias of glimpse_decoder and z_pres encoder are becoming Nan just after the first training batch. My questions are:

  1. Under what condition can these incidents happen? Does it have to do anything with network architecture? Can awry network lead to Nan gradients?

  2. What does affine_grid and grid_sample exactly do? I read the documentation but could not exactly understand their purpose. I used them to take a glimpse from the image based on z_where. So, is its purpose similar to cropping a picture? Someone please explain the arguments of F.affine_grid, what do they actually expect in the translation arguments? Can this cause Nan ever?

  3. Please take a look at my code, this single error has caused me lot of wastage of time. I will be grateful if someone figures out what is wrong in this.

Thanks in advance :slight_smile:

Have you given exploding gradients a thought?
I would think it is the most likely scenario for your nan gradients.
You can try to reduce the learning rate and check whether that helps.

  1. Division by zero or reaching infinity somewhere(not limited to these cases) can cause Nan.
  2. I don’t think these two can cause Nan. These 2 functions are generally used Spatial Transformer Networks. affine_grid apply linear transformation like rotating, translating, scaling(I think sheering is also part of the transformation but I’m not sure) to the input image. grid_sample uses the coordinates generated by affine_grid to perform a bi-linear interpolation on the original image to form a result image.
  3. Ur code is quite long :joy: so I only took a glance. My suggestion is to check the part where u used .exp(), I had expierenced Nan using expontenial operations tho it may not be the case for u. If that’s the case, then the Nan in gradients can also be explained. U can also use tc.any(tc.isnan(INPUT)) to see where did Nan show up.

Hope this helps

1 Like

Thank you so much for your reply. Can you explain what does the argument of F.affine_grid represent? Like, it is a b23 matrix where in that 2*3 matrices, it contains some affine transformation arguments. Can you explain which element is for scaling and which is for translation?

I didn’t read the paper carefully so I might be wrong. For the 2*3 matrix, 1 value is scaling, 1 for x-translation, 1 for y-translation, 1 for rotating, 2 for sheering. I’m not sure about the ordering tho, u should take a look at the paper.