Error: Cudnn RNN backward can only be called in training mode

#1

I’m running into the following error:

Cudnn RNN backward can only be called in training mode

The first epoch trains fine, and the error only occurs on the 2nd epoch. I’ve made sure to call model.train() prior to generating the hidden layer inputs, calculating loss, and before loss.backward(), and model.training displays as True when I check it. If it’s useful I have to use loss.backward(retain_graph=True).

What other reasons might be causing this error? Are there other items I need to make sure I run .train() on?

(Jerin Philip) #2

Chances are high that there maybe a model.eval while computing validation loss running in between epochs (perhaps based on updates rather than epochs) which could be doing this?

#3

In my epoch loop I have train and validate as separate function calls. In the train function I call model.train() at the start, and at the start of the validate function I call model.eval(). I thought this would automatically handle making the model trainable again…however when I comment out the validate function (so the model only trains) it actually seems to work.

Do I need to run model.train() somewhere again?

#4

you should read this link,I suppose it can solve your problems https://discuss.pytorch.org/t/cudnn-rnn-backward-can-only-be-called-in-training-mode/37622/2

#5

Thanks yuanzhoulvpi and jerinphilip for your input.

I was using a custom loss function and as it turns out this was the issue - data I was tracking was not being properly detached and this somehow affected the model being able to be fully put back into training mode.

All fixed now!