Cudnn RNN backward can only be called in training mode
The first epoch trains fine, and the error only occurs on the 2nd epoch. I’ve made sure to call model.train() prior to generating the hidden layer inputs, calculating loss, and before loss.backward(), and model.training displays as True when I check it. If it’s useful I have to use loss.backward(retain_graph=True).
What other reasons might be causing this error? Are there other items I need to make sure I run .train() on?
Chances are high that there maybe a model.eval while computing validation loss running in between epochs (perhaps based on updates rather than epochs) which could be doing this?
In my epoch loop I have train and validate as separate function calls. In the train function I call model.train() at the start, and at the start of the validate function I call model.eval(). I thought this would automatically handle making the model trainable again…however when I comment out the validate function (so the model only trains) it actually seems to work.
Thanks yuanzhoulvpi and jerinphilip for your input.
I was using a custom loss function and as it turns out this was the issue - data I was tracking was not being properly detached and this somehow affected the model being able to be fully put back into training mode.