Pytorch cudnn RNN backward can only be called in training mode

Lewis_Liu · May 6, 2020, 10:00pm

The Error is ’ cudnn RNN backward can only be called in training mode.’

I’m doing a DDPG with LSTM network. Part of the code is shown blow.
It works fine on CPU but when run on GPU, it reports the error.
But! As you can see that I’ve ready set the critic network to train mode, why it still reports error literally four lines after that??

    self.target_critic.eval()
    self.critic.eval()

    target_actions=self.target_actor.forward(new_state)
    critic_value_=self.target_critic.forward(new_state,target_actions)
    critic_value=self.critic.forward(state,action)

    target=[]
    for j in range(self.batch_size):
        target.append(reward[j]+self.gamma*critic_value_[j]*done[j])
    target=T.tensor(target).to(self.critic.device)
    target=target.view(self.batch_size,1)
    
    self.critic.train()
    self.critic.optimizer.zero_grad()
    critic_loss=F.mse_loss(target,critic_value)
    critic_loss.backward()
    self.critic.optimizer.step()

I checked the .training boolean value and confirmed that all my networks are in training mode. Yet it still reports the issue

emcp · May 6, 2020, 10:15pm

When you call .eval() on a model… it’s telling the model it is not going to be training anymore… as I understand anyway.

I am not sure you can switch from train mode to eval mode like your code is demonstrating… and I’d follow the lead of most examples and separate the two loops entirely into separate files

Lewis_Liu · May 6, 2020, 10:26pm

That’s bad…so based on what you said, it’s impossible to use RNN in such application(DDPG)? I couldn’t come up with an alternative…

emcp · May 7, 2020, 5:08am

I would look at a basic RNN example before you conclude that…

im no expert, just chiming in regarding .eval()

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

ptrblck · May 7, 2020, 6:36am

You could call either

call .train() on the rnn module after using model.eval()
call .eval() only on the necessary modules
or disable cudnn via torch.backends.cudnn.enabled = False

Lewis_Liu · May 7, 2020, 8:24am

But I did call .train() before the backward line. Doesn’t it count?

Sowmya_Krishnan · March 11, 2021, 6:15am

Hi all

I’m facing the exact same issue in PyTorch when training a model with reinforcement learning. It works fine in CPU, but throws this error when training with GPU.

cudnn RNN backward can only be called in training mode

I have also set my model back to train() mode after calling eval() mode once. I have also tried setting torch.backends.cudnn.enabled=False. The error still remains. Is there any update on this issue? What is the solution?

ptrblck · March 11, 2021, 9:21am

Disabling cudnn should work, so could you recheck it?
The suggested solutions are given in the previous post, which all point towards either leaving the cudnn RNN in training mode or disabling cudnn (for this layer).

Sowmya_Krishnan · March 13, 2021, 3:44am

I have rechecked it - disabling cudnn did not work for me. However, leaving the model in train() mode rather than switching between train() and eval() did the trick for me. This trick had to be applied only when training the model in GPU - in CPU the switch from eval() to train() works perfectly fine. Thanks for the suggestion!
One question though - I tried disabling cudnn just before I call the backward() function. Is that a reason for it not working? Should I disable it at the point where the training process begins?

ptrblck · March 13, 2021, 4:04am

Yes, that’s the reason, as you should disable it for the forward and backward pass for this layer.