RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time

Jingles · April 24, 2021, 3:16am

I took the LSTM model from this site, and found that the self.hidden_cell is causing the problem.

class LSTM(nn.Module):
    def __init__(self, input_size=1, hidden_layer_size=100, output_size=1):
        super().__init__()
        self.hidden_layer_size = hidden_layer_size

        self.lstm = nn.LSTM(input_size, hidden_layer_size)

        self.linear = nn.Linear(hidden_layer_size, output_size)

        self.hidden_cell = (torch.zeros(1,1,self.hidden_layer_size),
                            torch.zeros(1,1,self.hidden_layer_size))

    def forward(self, input_seq):
        lstm_out, self.hidden_cell = self.lstm(input_seq.view(len(input_seq) ,1, -1), self.hidden_cell)
        predictions = self.linear(lstm_out.view(len(input_seq), -1))
        return predictions[-1]

So I removed it with:

lstm_out, _ = self.lstm(input_seq.view(len(input_seq), 1, -1))

hfzarslan · July 2, 2021, 11:56am

I am getting the same error as mostly people getting here is my code
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

class RNN(nn.Module):
    
    def __init__(self,input_size, output_size, hidden_size=64):

        super().__init__()

        self.input_size  = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.xh = nn.Linear(self.input_size, self.hidden_size, bias=False)
        self.hh = nn.Linear(self.hidden_size, self.hidden_size)
        self.hy = nn.Linear(self.hidden_size, self.output_size)
        
        
        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(dim=1)
        self.sigmoid = nn.Sigmoid()

    def rnn_cell(self, x, prev_h):  
        h = None
        act = self.xh(x)+ self.hh(prev_h)
        h = self.tanh(act)

        updated_c = self.sigmoid(self.hy(h))
        
        return updated_c, h


    def forward(self, inp, h):
        return self.rnn_cell(inp, h)

Jiqing_Zhang · September 2, 2021, 7:56am

@ptrblck Hi，

I got the same error, and I view these solutions. However, I still don’t know how to solve my problem.
Here is code,

    for iter, input in enumerate(train_loader):
        template = input['template']            #read input
        search = input['search']
        label_cls = input['out_label']
        reg_label = input['reg_label']
        reg_weight = input['reg_weight']

        cfg_cnn = [(2, 16, 2, 0, 3),
                   (16, 32, 2, 0, 3),
                   (32, 64, 2, 0, 3),
                   (64, 128, 1, 1, 3),
                   (128, 256, 1, 1, 3)]
        cfg_kernel = [127, 63, 31, 31, 31]
        cfg_kernel_first = [63,31,15,15,15]

        c1_m = c1_s = torch.zeros(1, cfg_cnn[0][1], cfg_kernel[0], cfg_kernel[0]).to(device)
        c2_m = c2_s = torch.zeros(1, cfg_cnn[1][1], cfg_kernel[1], cfg_kernel[1]).to(device)
        c3_m = c3_s = torch.zeros(1, cfg_cnn[2][1], cfg_kernel[2], cfg_kernel[2]).to(device)
        trans_snn = [c1_m, c1_s, c2_m, c2_s, c3_m, c3_s]          # use this list

        for i in range(search.shape[-1]):
            cls_loss_ori, cls_loss_align, reg_loss, trans_snn = model(template.squeeze(-1), \
                                                                   search[:,:,:,:,i], trans_snn,\
                                                                label_cls[:,:,:,i], \
                                                               reg_target=reg_label[:,:,:,:,i], reg_weight=reg_weight[:,:,:,i])
             .......
            loss = cls_loss_ori + cls_loss_align + reg_loss
            optimizer.zero_grad()
            loss.backward()

I think the reason why this code is error is that in the loop, I keep updating the value of the variable trans_snn. However, I have no idea about how to solve it by renaming trans_snn. Looking for your help. Thank you very much!

Jiqing_Zhang · September 2, 2021, 8:04am

if I remove trans_snn = [c1_m, c1_s, c2_m, c2_s, c3_m, c3_s] into loop, the error will not happen. However, I need the updated trans_snn .

ptrblck · September 2, 2021, 9:48am

I don’t know how you are updating trans_snn, but assuming you are assigning intermediate or the output tensors from the model, they would most likely still be attached to the computation graph and you would thus “attach the computation graph” to trans_snn.
If that’s the desired use case, you would have to use backward(retain_graph=True), but in most of the cases this is not the desired behavior and you might want to consider detaching the tensors which are assigned to trans_snn.

cmz13 · September 12, 2021, 1:37pm

Dear Ptrblck,

@ptrblck I have a question. I have successfully trained a GCN model. and then I want to re-train this GCN model with some constraints. But I got the same error when performing loss.backward(). i.e., “RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.”. Could you please help me to solve it?

Thanks so much for your kind help.

Best,
cm

ptrblck · September 12, 2021, 11:20pm

This error message is generally raised, if the intermediate activations were already freed by a previous backward() operation while you are trying to calculate the gradients a second time.
This could happen, if you are directly calling backward multiple times without specifying retain_graph=True:

model = nn.Linear(1, 1)
x = torch.randn(1, 1)
out = model(x)

out.backward()
out.backward()
> RuntimeError: Trying to backward through the graph a second time

or if the computation graph is still attached to a previous forward pass (as is often the case in RNNs when the states are not detached).

cmz13 · September 13, 2021, 1:50am

I used the function: out.backward(retain_graph=True) at the first training, but I get this error when performing the second training. Could you please give me some advice on it?

Thanks

ptrblck · September 13, 2021, 2:00am

I would not recommend to simply use retain_graph=True as a workaround without checking, if it’s really the desired behavior (in the majority of the use cases I’ve seen so far it was not the wanted behavior and was used as a workaround).

cmz13 · September 13, 2021, 2:08am

Thanks for your answer. I have carefully checked my code, and added detach() at variable from previous graph, I can successfully run my code.

Best,
cm

Jiyao_Li · November 4, 2021, 1:26am

Thanks for your explain.

Do you have any idea how to check which path not cover in the second time?

binbbaz · February 22, 2022, 11:26am

God bless you sir.

I wish I could buy you a cup of coffee

I have been battling with this for sometime now.

This just ended my two weeks struggle.

Thank you so much

WolfLo · May 12, 2022, 10:54am

Hi, how did you solve this please?
I think that I have a similar use case and am experiencing the same backprop error

WolfLo · May 12, 2022, 2:10pm

What are the ops that do not require buffers, please?

albanD · May 16, 2022, 4:54pm

I’m afraid we don’t have a list of these.
It will depend on the exact formula for each op I’m afraid.
There are some places in the code where you could read about them, but you can also use tools like torchviz to plot what is saved by using show_saved=True.

Padmaksha_Roy · June 10, 2022, 5:06am

hi @albanD , I am trying to do a similar thing where I have a reconstruction loss and a kernel alignment loss. They are calculated as below:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.We1 = torch.nn.Parameter(torch.Tensor(input_length, args.hidden_size).uniform_(-1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
        self.We2 = torch.nn.Parameter(torch.Tensor(args.hidden_size, args.code_size).uniform_(-1.0 / math.sqrt(args.hidden_size), 1.0 / math.sqrt(args.hidden_size)))

        self.be1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        self.be2 = torch.nn.Parameter(torch.zeros([args.code_size]))

    def encoder(self, encoder_inputs):
        hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        return code

    def decoder(self,encoder_inputs):
        code = self.encoder(encoder_inputs)

        # ----- DECODER -----
        if tied_weights:

            Wd1 = torch.transpose(We2)
            Wd2 = torch.transpose(We1)

        else:

            Wd1 = torch.nn.Parameter(
                torch.Tensor(args.code_size, args.hidden_size).uniform_(-1.0 / math.sqrt(args.code_size),
                                                                           1.0 / math.sqrt(args.code_size)))
            Wd2 = torch.nn.Parameter(
                torch.Tensor(args.hidden_size, input_length).uniform_(-1.0 / math.sqrt(args.hidden_size),
                                                                             1.0 / math.sqrt(args.hidden_size)))

            bd1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
            bd2 = torch.nn.Parameter(torch.zeros([input_length]))

            if lin_dec:
                hidden_2 = torch.matmul(code, Wd1) + bd1
            else:
                hidden_2 = torch.tanh(torch.matmul(code, Wd1) + bd1)

            dec_out = torch.matmul(hidden_2, Wd2) + bd2

        return  dec_out

    def kernel_loss(self,code, prior_K):
        # kernel on codes
        code_K = torch.mm(code, torch.t(code))

        # ----- LOSS -----
        # kernel alignment loss with normalized Frobenius norm
        code_K_norm = code_K / torch.linalg.matrix_norm(code_K, ord='fro', dim=(- 2, - 1))
        prior_K_norm = prior_K / torch.linalg.matrix_norm(prior_K, ord='fro', dim=(- 2, - 1))
        k_loss = torch.linalg.matrix_norm(torch.sub(code_K_norm,prior_K_norm), ord='fro', dim=(- 2, - 1))
        return k_loss

# Initialize model
model = Model()

Now, during training I pass my training data as inputs to the encoder and decoder.

for ep in range(args.num_epochs):
    for batch in range(max_batches):
        # get input data
            
            dec_out = model.decoder(encoder_inputs)
            reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)
            enc_out = model.encoder(encoder_inputs)
            k_loss = model.kernel_loss(enc_out,prior_K)
       

            tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss
            tot_loss = tot_loss.float()

            # Backpropagation
            optimizer.zero_grad()
            #tot_loss.backward(retain_graph=True)
            tot_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_gradient_norm)
            optimizer.step()

This always gives me an error saying “RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).
Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need
to backward through the graph a second time”

It works only when I activate the retain_graph flag. But takes huge time for training. Can you please let me know what wrong I am doing here?

Thank you!

albanD · June 10, 2022, 12:58pm

Hi,

I can’t say for sure given the code you shared but this is most likely due to some part of the computation being re-used from one iteration to the next.
Either because something was computed before entering the loop in a differentiable way and is being re-used. Or something is passed from one iteration to the next in a differentiable way.

If looking at the code doesn’t work, one way to debug these is to use a visualization tool like: GitHub - szagoruyko/pytorchviz: A small package to create visualizations of PyTorch execution graphs
You can get tot_loss from the first and second iterations and print them at the same time with make_dot. If any part of the graph is shared between the two that means that you have some shared computation that should not be here. Another thing that can happen is that the graph for the second one almost completely depends on the graph from the first one. That would indicate that the second iteration depends on the first in a bad way.

Padmaksha_Roy · June 10, 2022, 5:30pm

hi @albanD , thank you so much for your time. So, before entering the loop , I am just declaring some variables to store the training progress and a regularization constant.

reg_loss = 0
parameters = torch.nn.utils.parameters_to_vector(model.parameters())
for tf_var in parameters:
    reg_loss += torch.mean(torch.linalg.norm(tf_var))

# initialize training variables
time_tr_start = time.time()
batch_size = args.batch_size
max_batches = train_data.shape[0] // batch_size
loss_track = []
kloss_track = []

for ep in range(args.num_epochs):
  
          # shuffle training data
          idx = np.random.permutation(train_data.shape[0])
          train_data_s = train_data[idx, :]
          K_tr_s = K_tr[idx, :][:, idx]
  
  
          for batch in range(max_batches):
              fdtr = {}
              fdtr["encoder_inputs"] = train_data_s[(batch) * batch_size:(batch + 1) * batch_size, :]
              fdtr["prior_K"] =  K_tr_s[(batch) * batch_size:(batch + 1) * batch_size,
                               (batch) * batch_size:(batch + 1) * batch_size]
  
              encoder_inputs = (fdtr["encoder_inputs"].astype(float))
              encoder_inputs = torch.from_numpy(encoder_inputs)
              
              prior_K = (fdtr["prior_K"].astype(float))
              prior_K = torch.from_numpy(prior_K)
  
              dec_out = model.decoder(encoder_inputs)
  
              reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)
              reconstruct_loss = reconstruct_loss.float()
             
              enc_out = model.encoder(encoder_inputs)
              k_loss = model.kernel_loss(enc_out,prior_K)
              k_loss = k_loss.float()
             
  
              tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss
              tot_loss = tot_loss.float()
  
              # Backpropagation
              optimizer.zero_grad()
              tot_loss.backward(retain_graph=True)
              #tot_loss.backward()
              torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_gradient_norm)
              optimizer.step()

Do you mean to say that I must define the reg_loss inside the training loop?
Thank you so much!

albanD · June 10, 2022, 5:35pm

Ho yes that reg_loss should be in the loop for sure!

Padmaksha_Roy · June 10, 2022, 5:53pm

Thank you so much!! It worked, I was spending days to find out what exactly is being reused . I just need one more small thing to be verified by you. I converted a TF code to pytorch. However, when I print the trainable model params, the pytorch code just gives exactly half of what TF code prints. Following are the codes:

sess = tf.Session()

# placeholders
encoder_inputs = tf.placeholder(shape=(None, input_length), dtype=tf.float32, name='encoder_inputs')
prior_K = tf.placeholder(shape=(None, None), dtype=tf.float32, name='prior_K')

# ----- ENCODER -----
We1 = tf.Variable(
    tf.random_uniform((input_length, args.hidden_size), -1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
We2 = tf.Variable(tf.random_uniform((args.hidden_size, args.code_size), -1.0 / math.sqrt(args.hidden_size),
                                    1.0 / math.sqrt(args.hidden_size)))

be1 = tf.Variable(tf.zeros([args.hidden_size]))
be2 = tf.Variable(tf.zeros([args.code_size]))

hidden_1 = tf.nn.tanh(tf.matmul(encoder_inputs, We1) + be1)
code = tf.nn.tanh(tf.matmul(hidden_1, We2) + be2)

# kernel on codes
code_K = tf.tensordot(code, tf.transpose(code), axes=1)


# ----- DECODER -----
if tied_weights:
    Wd1 = tf.transpose(We2)
    Wd2 = tf.transpose(We1)
else:
    Wd1 = tf.Variable(tf.random_uniform((args.code_size, args.hidden_size), -1.0 / math.sqrt(args.code_size),
                                        1.0 / math.sqrt(args.code_size)))
    Wd2 = tf.Variable(tf.random_uniform((args.hidden_size, input_length), -1.0 / math.sqrt(args.hidden_size),
                                        1.0 / math.sqrt(args.hidden_size)))

bd1 = tf.Variable(tf.zeros([args.hidden_size]))
bd2 = tf.Variable(tf.zeros([input_length]))

if lin_dec:
    hidden_2 = tf.matmul(code, Wd1) + bd1
else:
    hidden_2 = tf.nn.tanh(tf.matmul(code, Wd1) + bd1)

dec_out = tf.matmul(hidden_2, Wd2) + bd2

# ----- LOSS -----
# kernel alignment loss with normalized Frobenius norm
code_K_norm = code_K / tf.norm(code_K, ord='fro', axis=[-2, -1])
prior_K_norm = prior_K / tf.norm(prior_K, ord='fro', axis=[-2, -1])
k_loss = tf.norm(code_K_norm - prior_K_norm, ord='fro', axis=[-2,-1])

And my converted Pytorch code is:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.We1 = torch.nn.Parameter(torch.Tensor(input_length, args.hidden_size).uniform_(-1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
        self.We2 = torch.nn.Parameter(torch.Tensor(args.hidden_size, args.code_size).uniform_(-1.0 / math.sqrt(args.hidden_size), 1.0 / math.sqrt(args.hidden_size)))

        self.be1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        self.be2 = torch.nn.Parameter(torch.zeros([args.code_size]))

    def encoder(self, encoder_inputs):
        hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        return code

    def decoder(self,encoder_inputs):
        # hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        # code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        code = self.encoder(encoder_inputs)

        # ----- DECODER -----
        if tied_weights:

            Wd1 = torch.transpose(We2)
            Wd2 = torch.transpose(We1)

        else:

            Wd1 = torch.nn.Parameter(
                torch.Tensor(args.code_size, args.hidden_size).uniform_(-1.0 / math.sqrt(args.code_size),
                                                                           1.0 / math.sqrt(args.code_size)))
            Wd2 = torch.nn.Parameter(
                torch.Tensor(args.hidden_size, input_length).uniform_(-1.0 / math.sqrt(args.hidden_size),
                                                                             1.0 / math.sqrt(args.hidden_size)))

            bd1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
            bd2 = torch.nn.Parameter(torch.zeros([input_length]))

            if lin_dec:
                hidden_2 = torch.matmul(code, Wd1) + bd1
            else:
                hidden_2 = torch.tanh(torch.matmul(code, Wd1) + bd1)

            dec_out = torch.matmul(hidden_2, Wd2) + bd2

        return  dec_out

    def kernel_loss(self,code, prior_K):
        # kernel on codes
        code_K = torch.mm(code, torch.t(code))

        # ----- LOSS -----
        # kernel alignment loss with normalized Frobenius norm
        code_K_norm = code_K / torch.linalg.matrix_norm(code_K, ord='fro', dim=(- 2, - 1))
        prior_K_norm = prior_K / torch.linalg.matrix_norm(prior_K, ord='fro', dim=(- 2, - 1))
        k_loss = torch.linalg.matrix_norm(torch.sub(code_K_norm,prior_K_norm), ord='fro', dim=(- 2, - 1))
        return k_loss

# Initialize model
model = Model()

Do you see anything seriously wrong here? I get exactly half training params and I guess this is affecting the gradients during backprop as well as I am not getting similar results.

Thanks a lot! Regards