So, I am trying to design a model for unsupervised object detection. As of now, its task is to locate the digits in a multi-MNIST digit dataset and draw a bounding box around them. I am using a variational-autoencoder based architecture and it is a modification over this paper.
Now I am facing a lot of problems in implementation. When I am running the code, after few batches, the weights are becoming Nan. Therefore I checked all the gradients of all the parameters and found that after a few steps the KL-divergence of the Z_pres variable is becoming Nan and moreover, the standard deviation of the gradient of the bias of glimpse_decoder and z_pres encoder are becoming Nan just after the first training batch. My questions are:
-
Under what condition can these incidents happen? Does it have to do anything with network architecture? Can awry network lead to Nan gradients?
-
What does affine_grid and grid_sample exactly do? I read the documentation but could not exactly understand their purpose. I used them to take a glimpse from the image based on z_where. So, is its purpose similar to cropping a picture? Someone please explain the arguments of F.affine_grid, what do they actually expect in the translation arguments? Can this cause Nan ever?
-
Please take a look at my code, this single error has caused me lot of wastage of time. I will be grateful if someone figures out what is wrong in this.
i)spair.py
ii)train_mnist.py
Thanks in advance