Unable to set model to training mode

Amith_Adiraju · August 21, 2019, 8:29pm

I’m trying to do a very simple FC Neural network. I have 2 GPUs on my machine, I’ve followed this tutorial to make the code use 2 GPUs: Data Parallel

For some reason, I’m unable to train the model, the error I keep getting is

RuntimeError: cudnn RNN backward can only be called in training mode

The solution seemed very trivial, to set the model to training mode before forward call, but that doesn’t fix the issue at all. I tried many different ways to actually set the model to train mode, but none worked.

Here’s my code:

Class

class ABC(nn.Module):
    
    def __init__(self, inp_dim_size, hid_dim_size, out_size):
        
        super(ABC, self).__init__()
        
        self.inp_dim_size = inp_dim_size
        
        self.hid_dim_size  = hid_dim_size
        
        self.out_size = out_size
        
        self.seq_layer = nn.Sequential(
                
            nn.Linear( self.inp_dim_size, self.hid_dim_size  ),
            
            nn.ELU(),
            
            nn.Dropout(0.4),
            nn.Linear(self.hid_dim_size, self.hid_dim_size // 2 ),
            
            nn.ELU(),
            
            nn.Dropout(0.4),
            nn.Linear(self.hid_dim_size // 2, self.hid_dim_size // 2),
            
            nn.ELU(),
            
            nn.Dropout(0.3),
            
            nn.Linear(self.hid_dim_size // 2, self.out_size)
        
        )
        
        
        
    def forward(self, X_batch):
        
        output_scores = self.seq_layer(X_batch)
        
        return output_scores

Train code:

for epoch in range(num_epochs):  # loop over the dataset multiple times

        rl, ns = 0.0, 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            
            inputs = inputs.to(device)
            scores = labels.to(device)
            
            br, _, _ = scores.shape

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            predictions = model(inputs)
            loss = criterion(predictions,scores.view(br, 1))
            loss.backward(retain_graph = True)
            optimizer.step()

            ...



model = ABC(103, 51, 1)


if torch.cuda.device_count() > 1:
    
    model = nn.DataParallel(model)
    
    model.to(device)
    
    model = model.train()
    
else:
    
    model = model.to(device)
    
    model = model.train()

It’s been pretty frustrating trying to solve seemingly easy issue without any results. Any inputs will be highly appreciated. TIA !

ptrblck · August 21, 2019, 10:05pm

Are you sure you are running this code, as you don’t use any RNNs, which the error message points to?

Amith_Adiraju · August 21, 2019, 10:07pm

I didn’t get the last part of your message.

ptrblck · August 21, 2019, 10:09pm

The error message:

claims you are running an RNN module in training mode and try to call backward() somewhere.
However, your ABC model doesn’t use any RNNs, so it seems you missed some code parts or the error is thrown from some other code parts.
If you are using a Jupyter notebook, make sure to restart the notebook.

Amith_Adiraju · August 21, 2019, 10:16pm

ok, will make sure and let you know. Thanks

Amith_Adiraju · August 22, 2019, 10:28pm

I think I figured out the issue. The data I was passing to my ABC network was generated through a saved RNN file, I forgot to use .detach() on that data, so while using my network, the input data was using require_grad=True, which it shouldn’t. Using .detach() fixed the issue. Thanks for suggestions @ptrblck.

ptrblck · August 22, 2019, 10:29pm

Good to hear you’ve figured it out!