Getting nan for gradients with LSTMCell

We are doing a customized LSTM using LSTMCell, on a binary classification, loss is BCEwithlogits.

We traced the problem back to loss.backward(). The calculated loss is not nan, but the gradients calculated from the loss are nans.

Things we’ve tried but not working

  • pytorch 3.1, 4.0, 5.0, all have the same problem
  • change softmax to logsoftmax in the forward pass
  • change loss to logsoftmax + NLLloss
  • change initialization of hidden and cell states to non-zeros

Any ideas?! Much appreciated!