I am training a simple cnn (2 conv layers) with a triplet margin loss. At some point during training the weights of the first conv layer become nan (I am still trying to figure that one out). The output from the first conv layer at this point is a tensor full of -Inf or +Inf values. Nevertheless the output to the entire network is a valid tensor with real numbers. In fact if I forward a tensor with Infs through a conv layer the output seems to be a bunch of 1s. Is this the expected behavior?
NaN is very common in deep learning training.
The reason usually is gradient exploding, so you can try to monitor gradient norm and apply gradient clipping.
Indeed I have been trying to check model weights and gradients. Since the problem seems to be at the level of the first conv2d in my model I tried attaching a backward hook to that nn.conv2d module. I am finding that at every epoch during my training the hook is called more and more times (it’s called 18 times the first epoch and then several hundreds of times by epoch 60). Since I am updating the network at every batch I was expecting the hook to be called number-of-batches times every epoch. Is this a bug?