The Error is ’ cudnn RNN backward can only be called in training mode.’
I’m doing a DDPG with LSTM network. Part of the code is shown blow.
It works fine on CPU but when run on GPU, it reports the error.
But! As you can see that I’ve ready set the critic network to train mode, why it still reports error literally four lines after that??
self.target_critic.eval()
self.critic.eval()
target_actions=self.target_actor.forward(new_state)
critic_value_=self.target_critic.forward(new_state,target_actions)
critic_value=self.critic.forward(state,action)
target=[]
for j in range(self.batch_size):
target.append(reward[j]+self.gamma*critic_value_[j]*done[j])
target=T.tensor(target).to(self.critic.device)
target=target.view(self.batch_size,1)
self.critic.train()
self.critic.optimizer.zero_grad()
critic_loss=F.mse_loss(target,critic_value)
critic_loss.backward()
self.critic.optimizer.step()
I checked the .training boolean value and confirmed that all my networks are in training mode. Yet it still reports the issue
When you call .eval() on a model… it’s telling the model it is not going to be training anymore… as I understand anyway.
I am not sure you can switch from train mode to eval mode like your code is demonstrating… and I’d follow the lead of most examples and separate the two loops entirely into separate files
I’m facing the exact same issue in PyTorch when training a model with reinforcement learning. It works fine in CPU, but throws this error when training with GPU.
cudnn RNN backward can only be called in training mode
I have also set my model back to train() mode after calling eval() mode once. I have also tried setting torch.backends.cudnn.enabled=False. The error still remains. Is there any update on this issue? What is the solution?
Disabling cudnn should work, so could you recheck it?
The suggested solutions are given in the previous post, which all point towards either leaving the cudnn RNN in training mode or disabling cudnn (for this layer).
I have rechecked it - disabling cudnn did not work for me. However, leaving the model in train() mode rather than switching between train() and eval() did the trick for me. This trick had to be applied only when training the model in GPU - in CPU the switch from eval() to train() works perfectly fine. Thanks for the suggestion!
One question though - I tried disabling cudnn just before I call the backward() function. Is that a reason for it not working? Should I disable it at the point where the training process begins?