Unable to set model to training mode

I’m trying to do a very simple FC Neural network. I have 2 GPUs on my machine, I’ve followed this tutorial to make the code use 2 GPUs: Data Parallel

For some reason, I’m unable to train the model, the error I keep getting is

RuntimeError: cudnn RNN backward can only be called in training mode

The solution seemed very trivial, to set the model to training mode before forward call, but that doesn’t fix the issue at all. I tried many different ways to actually set the model to train mode, but none worked.

Here’s my code:


class ABC(nn.Module):
    def __init__(self, inp_dim_size, hid_dim_size, out_size):
        super(ABC, self).__init__()
        self.inp_dim_size = inp_dim_size
        self.hid_dim_size  = hid_dim_size
        self.out_size = out_size
        self.seq_layer = nn.Sequential(
            nn.Linear( self.inp_dim_size, self.hid_dim_size  ),
            nn.Linear(self.hid_dim_size, self.hid_dim_size // 2 ),
            nn.Linear(self.hid_dim_size // 2, self.hid_dim_size // 2),
            nn.Linear(self.hid_dim_size // 2, self.out_size)
    def forward(self, X_batch):
        output_scores = self.seq_layer(X_batch)
        return output_scores

Train code:

for epoch in range(num_epochs):  # loop over the dataset multiple times

        rl, ns = 0.0, 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs = inputs.to(device)
            scores = labels.to(device)
            br, _, _ = scores.shape

            # zero the parameter gradients

            # forward + backward + optimize
            predictions = model(inputs)
            loss = criterion(predictions,scores.view(br, 1))
            loss.backward(retain_graph = True)


model = ABC(103, 51, 1)

if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
    model = model.train()
    model = model.to(device)
    model = model.train()

It’s been pretty frustrating trying to solve seemingly easy issue without any results. Any inputs will be highly appreciated. TIA !

Are you sure you are running this code, as you don’t use any RNNs, which the error message points to?

I didn’t get the last part of your message.

The error message:

claims you are running an RNN module in training mode and try to call backward() somewhere.
However, your ABC model doesn’t use any RNNs, so it seems you missed some code parts or the error is thrown from some other code parts.
If you are using a Jupyter notebook, make sure to restart the notebook.

ok, will make sure and let you know. Thanks

I think I figured out the issue. The data I was passing to my ABC network was generated through a saved RNN file, I forgot to use .detach() on that data, so while using my network, the input data was using require_grad=True, which it shouldn’t. Using .detach() fixed the issue. Thanks for suggestions @ptrblck.

Good to hear you’ve figured it out! :slight_smile: