Training cascade separate Models

Hi,
I am a newbie have been working on simple RNN modelling, and now I am designing a system that has two models (model_A, model_B), both RNN models.
After trying weeks to solve my issue, and finally asking you guys for help to make it run.
My model is intended to work as follows:

  1. model_A is trained with input x and output y …no problem.
  2. when done, model_A is set to eval mode by 'model_A.eval() ’
  3. model_B is defined and to be trained in the following way:
    input x->model_B-> z(as nterim output)->model_A->y as output
  4. then it shows an error when model_B.backward() is called : “cudnn RNN backward can only be called in training mode”
    It seems like that model_A is blocking the gradient process. but have no idea how to work around it

I need your help for any ideas to solve this issue…
Thank you.

You could either call .train() on the nn.RNN, as cudnn needs this layer in training mode to calculate the gradients (and disable dropout if necessary in this layer via model.rnn.dropout = 0.0) or disable cudnn for this layer via:

x = ...
with torch.backends.cudnn.flags(enabled=False):
    x = self.rnn(x)
x = ...

thanks for the comment, as you mentioned, I put the code in forward():

...
   x_rnn_input = x.view(batch_size,-1,rnn_input_size)  
    with torch.backends.cudnn.flags(enabled=False): 
        output_rnn, hidden_rnn = self.rnn_layer(x_rnn_input
...

but the same result is given. (yes, the dropout = 0 is set as well)

The approach seems to work here.
Could you post an executable code snippet, which would reproduce the issue you are seeing?

Not using torch.nograd() during eval???

If model_A shouldn’t be trained, this would be the right approach.
However, based on the initial description of the issue I understood that model_A should be further trained in eval() mode.

CC @superco to clarify the use case. :slight_smile:

I am sorry for a long delay… I encountered weird situation when simplyfing my code for this session. It has a Trained model (Model_A) is used in a cascade chain, to train Model_B.

Q1. My question was : how to set Model_A not be trained while training Model_B, but it must be within the data flow, that is, it should support back propagation for Model_B training.
Q2.Now I encountered a new situation. when the device is set ‘cpu’, it shows no error. I have not checked if it actually works well, but at least it runs without en error.
However, when device is set to ‘cuda’ it shows the error that I originally asked (“cudnn RNN backward can only be called in training mode”)…Of course I have a gpu device.

Please check out my working code and give me any suggestion:


import torch
import torch.nn as nn
import torch.optim as optim
from numpy import *
 
class Net_1(nn.Module):
   
    def __init__(self, emb_input_size, rnn_input_size, hidden_size, num_layers, dropout_prob, batch_first=True):
        super(Net_1, self).__init__()
        self.rnn_layer = nn.RNN(rnn_input_size, hidden_size, num_layers,batch_first = batch_first)
        self.linear1 = nn.Linear(hidden_size, rnn_input_size)   #                                             
        
    def forward(self, x):
        batch_size = 50 
        x_rnn_input = x.view(batch_size,-1,rnn_input_size) 
        output_rnn, hidden_rnn = self.rnn_layer(x_rnn_input)
        output = self.linear1(output_rnn)  
        return output.view(x.shape[0],-1) # maintain same shape as the input
#####################
# Hyper-parameters for Model_A
#####################
hyper_parameters = [1, 10**2, 1e-2, 8, 3, 0.1]
[num_datasets, num_epochs,learning_rate,hidden_size,num_hidden_layers,dropout_prob] = hyper_parameters
emb_input_size = 0
rnn_input_size = 2

#####################
# Setup Model for Model_A
#####################
Model_A = Net_1(emb_input_size, rnn_input_size, hidden_size, num_hidden_layers, dropout_prob, batch_first=True)
device = torch.device("cuda" if (torch.cuda.is_available())else "cpu")
device = "cpu"
Model_A.to(device)
#####################
# Loss Model for Model_A
#####################
loss_func_A = nn.MSELoss(reduction='sum')
optimizer_A = optim.Adam(params = Model_A.parameters(), lr = learning_rate)
#####################
# Data stream
#####################
x_data = (random.randn(1,1000)) 
y_data = x_data-0.01*x_data**3
x_chunk = (torch.tensor(x_data,dtype = torch.float)).to(device)   # x_data and y_data are numpy float of [1,1000]
y_chunk = (torch.tensor(y_data,dtype = torch.float)).to(device)

#####################
# Training Model_A
#####################
for step2 in range(num_epochs):
    optimizer_A.zero_grad()
    output_A = Model_A(x_chunk) 
    loss_A = loss_func_A(output_A.view(x_chunk.shape[0],-1), y_chunk.view(x_chunk.shape[0],-1))    # compare total output with the target
    loss_A.backward()
    optimizer_A.step()
        
Model_A.eval()    # evaluation mode
        
#####################
# Hyper-parameters for Model_B
#####################
emb_input_size = 0
rnn_input_size = 2
hyper_parameters = [1, 1*(10**2), 1e-2, 24, 2, 0.1]  
[num_datasets, num_epochs,learning_rate,hidden_size,num_hidden_layers,dropout_prob] = hyper_parameters
#####################
# Setup Model B
#####################
Model_B = Net_1(emb_input_size, rnn_input_size, hidden_size, num_hidden_layers, dropout_prob, batch_first=True)
Model_B = Model_B.to(device)
Model_B.train()  # set Model_B for training
            
loss_func_B = nn.MSELoss(reduction='sum')
optimizer_B = optim.Adam(params = Model_B.parameters(), lr = learning_rate)
#####################
# Training for Model B
#####################
for step in range(num_epochs):
    optimizer_B.zero_grad()
    z_A = Model_B(x_chunk)     # x->model_B-> z_A ->model_A (=x)
    output_A = Model_A(z_A)     # get model_A output
    loss_B = loss_func_B(output_A.view(x_chunk.shape[0],-1), x_chunk.view(x_chunk.shape[0],-1))    # compare total output with the target sentence
    loss_B.backward()
    optimizer_B.step()
Model_A.eval()    
Model_B.eval()   # stop training

print("#--- Progrma End ---#")

Based on your description it seems you don’t want to train modelA as @shartzog suggested.
In that case, you could either wrap the forward pass of modelA in a with torch.no_grad() block or set the .requires_grad attribute of all parameters or modelA to False.

1 Like

Hi, I want two step training:
step 1: train only model_A
step 2: train model_B in connection to model_A that is in eval mode

-Do you think I still need to wrap the forward pass for step2? how to wrap it only for step 2…?

-And how should I understand the situation that the error occurs only when device=cuda ?

  1. Yes, you should still disable the gradient calculation using one of the mentioned approaches.
    If your training consists of two steps, I guess you could use two different training methods or just add the second step after the first training run is done.

  2. This error is raised, as the cudnn RNN implementation cannot calculate gradients during the evaluation. Please refer to the previous posts for more information.