GRU training problem with retain_graph

I am aware that, while employing loss.backward() we need to specify retain_graph=True if there are multiple networks and multiple loss functions to optimize each network separately. But even with (or without) specifying this parameter I am getting errors. Following is an MWE to reproduce the issue (on PyTorch 1.6).

import torch
from torch import nn
from torch import optim
torch.autograd.set_detect_anomaly(True)


class GRU1(nn.Module):
    def __init__(self):
        super(GRU1, self).__init__()
        self.brnn = nn.GRU(input_size=2, bidirectional=True, num_layers=1, hidden_size=100)

    def forward(self, x):
        return self.brnn(x)


class GRU2(nn.Module):
    def __init__(self):
        super(GRU2, self).__init__()
        self.brnn = nn.GRU(input_size=200, bidirectional=True, num_layers=1, hidden_size=1)

    def forward(self, x):
        return self.brnn(x)


for i in range(100):
    gru1 = GRU1()
    gru2 = GRU2()
    gru1_opt = optim.Adam(gru1.parameters())
    gru2_opt = optim.Adam(gru2.parameters())
    gru1_opt.zero_grad()
    gru2_opt.zero_grad()
    criterion = nn.MSELoss()
    vector = torch.randn((15, 100, 2))
    gru1_output, _ = gru1(vector)  # (15, 100, 200)
    loss_gru1 = criterion(gru1_output, torch.randn((15, 100, 200)))
    loss_gru1.backward(retain_graph=True)
    gru1_opt.step()
    gru2_output, _ = gru2(gru1_output)  # (15, 100, 2)
    loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
    loss_gru2.backward(retain_graph=True)
    gru2_opt.step()
    print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")

With retain_graph set to True I get the error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 300]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The error without the parameter is

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

which is expected.

Please point at what needs to be changed in the above code for it to begin training. Any help would be appreciated.

Hi,

The error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 300]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Comes from the fact that when you run loss_gru2.backward() (note that the retain_graph is not needed here), you actually backprop all the wait to gru1. But gru1's paramters where modified inplace by gre1_opt.step(). Thus the error you’re seeing.

If you want the second part to only consider gru2, then you can detach the output from gru1 to make them independent:

    gru2_output, _ = gru2(gru1_output.detach())  # (15, 100, 2)

This will prevent the backprop from going all the way to gru1.

Note that it will also remove the need from retain_graph=True in the first backward of gru1 because now you don’t backprop through it again !

@albanD hi, thanks for the response. However the use case I am looking at is to train the first network GRU1 in the second part as well. How would that be achieved? I have added the updated code below:

import torch
from torch import nn
from torch import optim
torch.autograd.set_detect_anomaly(True)


class GRU1(nn.Module):
    def __init__(self):
        super(GRU1, self).__init__()
        self.brnn = nn.GRU(input_size=2, bidirectional=True, num_layers=1, hidden_size=100)

    def forward(self, x):
        return self.brnn(x)


class GRU2(nn.Module):
    def __init__(self):
        super(GRU2, self).__init__()
        self.brnn = nn.GRU(input_size=200, bidirectional=True, num_layers=1, hidden_size=1)

    def forward(self, x):
        return self.brnn(x)


gru1 = GRU1()
gru2 = GRU2()
gru1_opt = optim.Adam(gru1.parameters())
gru2_opt = optim.Adam(gru2.parameters())
criterion = nn.MSELoss()

for i in range(100):
    gru1_opt.zero_grad()
    gru2_opt.zero_grad()
    vector = torch.randn((15, 100, 2))
    gru1_output, _ = gru1(vector)  # (15, 100, 200)
    gru2_output, _ = gru2(gru1_output)  # (15, 100, 2)
    loss_gru1 = criterion(gru2_output, torch.randn((15, 100, 2)))
    loss_gru1.backward()
    gru1_opt.step()
    gru1_output, _ = gru1(gru2_output)
    gru2_output, _ = gru2(gru1_output)  # (15, 100, 2)
    loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
    loss_gru2.backward()
    gru2_opt.step()
    print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")

What should one do in such a case? Thanks!

Hi,

So you want the gru2_opt to use the gradient for both use of gru2?
In that case, you will have to delay the gru1_opt,step() I’m afraid. Or recompute the forward after the gru1 net has been updated if that’s what you want.

@albanD that is not what I was expecting, because gru1_opt.step() would help the netwrok predict better in the next time instant following step(). Is there any workaround? Also I got the following to work:

for i in range(100):
    gru1_opt.zero_grad()
    gru2_opt.zero_grad()
    vector = torch.randn((15, 100, 2))
    gru1_output, _ = gru1(vector)  # (15, 100, 200)
    gru2_output, _ = gru2(gru1_output)  # (15, 100, 2)
    loss_gru1 = criterion(gru2_output, torch.randn((15, 100, 2)))
    loss_gru1.backward(retain_graph=True)
    gru1_opt.step()
    gru1_output, _ = gru1(vector)
    gru2_output, _ = gru2(gru1_output)  # (15, 100, 2)
    loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
    loss_gru2.backward(retain_graph=True)
    gru2_opt.step()
    print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")

But I guess I will run into memory issues.

PS: In the first backprop I only need update the weights of GRU1. In the second backprop, I need to update the weights of GRU2

In essence, I want to flow the gradients through GRU2 and update the weights of GRU1 in the first instance. In the second part I need to update the weights of GRU2 only.

Yes this code sample will work as the second set of update is from vector which is not linked to the first set of forwards. That will work fine if it’s what you want.
You can actually remove the reain_graph=True there are you don’t backprop multiple time :slight_smile:

But I guess I will run into memory issues.

I don’t think so, why would you?

@albanD because I would be retaining the graph every time, won’t I be having memory problem after some iterations?

Well you don’t need to retain the graph in this case I think.

But even if you do so, retain_graph only prevents it from being freed during the backward pass. But when nothing can access it anymore, it will be freed: it does not leak memory. It just delays when the memory is released.