This could be due to something very stupid I did but I get massive loss values like
I am using a LogSoftmax + NLLLoss. I have noticed that this massive loss occurs only when I have multiple decoders stacked together. The implementation is a multi layer convolutional decoder with attention. If I skip over any attention computation, I get a more sane loss value in the order of hundreds.
Attention = Linear + bmm + Softmax
Though, the network does learn and training accuracies increases over time (validation accuracy is not increasing past zero though), loss is either zero or exceptionally high.
Any idea what could cause such an issue? (I did not know what other information to provide. Please let me know if any other info would be helpful)