I’m using a transformer, using alternatives of softmax like sparsemax etc. With softmax, things work out smoothly, but with the other function, I’m getting nan somewhere during my training. I tried with anomaly detection, it says:
Function 'BinaryCrossEntropyWithLogitsBackward' returned nan values in its 0th output.
This doesn’t happen when I use softmax. I’m now using anomaly detection on the attention directly. Now training takes 4x more time. What can be done to pin point the source of nan ?