Multiple Loss Functions in a Model

MUnique · February 9, 2021, 9:55pm

Hello everyone,

I am trying to train a model constructed of three different modules. An encoder, a decoder, and a discriminator. These three are connected as follows. Each of these three should minimize its own loss function which is different from the others.

Here is my code snippet:

with torch.autograd.set_detect_anomaly(True):
for epoch in range(num_epochs):
for i, (data, labels) in enumerate(train_loader):

        data = Variable(data.cuda())
        labels = Variable(labels.cuda())

        Encoder_hiddens = Variable((torch.zeros(2 * num_encoder_layers, batch_size, hidden_size)).cuda())
        Disc_hiddens = Variable((torch.zeros(2 * num_disc_layers, batch_size, hidden_size)).cuda())

        masked_data = data
        masked_data[:, 1:, mask_part] = 0

        # forward step
        optimizer_encoder.zero_grad()

        outputs_end, hidden_enc = model_encoder(data, Encoder_hiddens)
        outputs_dec = model_decoder(masked_data, hidden_enc)

        loss_ele = pixelwise_criterion(outputs_dec, data)
        # backward step
        loss_ele.backward()
        # optimization step
        optimizer_encoder.step()

        optimizer_decoder.zero_grad()
        disc_real = model_disc(data, Disc_hiddens).view(-1)
        lossD_real = adversarial_criterion(disc_real, torch.ones_like(disc_real))
        disc_fake = model_disc(outputs_dec, Disc_hiddens).view(-1)
        lossD_fake = adversarial_criterion(disc_fake, torch.zeros_like(disc_fake))
        lossD = (lossD_real + lossD_fake) / 2
        loss_decoder = adv_ratio * lossD + (1 - adv_ratio) * loss_ele
        loss_decoder.backward()
        optimizer_decoder.step()

        optimizer_disc.zero_grad()
        lossD.backward()
        optimizer_disc.step()

I am getting the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

When I use retain_graph=True (loss_ele.backward(retain_graph=True)), I get the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [450, 45]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later.

Does anybody know how it can be fixed?

Thanks,

Dwight_Foster · February 9, 2021, 10:29pm

This is because the loss function releases the data after the backward pass. Because you are passing the outputs_dec into the discriminator after the loss has already been computed for the encoder the graphs combine. To stop this you can do

model_disc(outputs_dec.detach()

MUnique · February 9, 2021, 10:45pm

Thanks for the reply. I tried disc_fake = model_disc(outputs_dec.detach(), Disc_hiddens).view(-1) but I am still getting the same error:

File “…/Temp.py”, line 167, in …
loss_decoder.backward(retain_graph=True)
File “…/lib/python3.6/site-packages/torch/tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “…/lib/python3.6/site-packages/torch/autograd/init.py”, line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [450, 45]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later.

InnovArul · February 9, 2021, 10:50pm

You may have to tell pytorch to retain the computation graph for future use. i.e., for gradient calculation of decoder.

        loss_ele.backward(retain_graph=True)

You may need retain_graph=True for loss_decoder.backward() as well.
But not for lossD.backward(), so that memory is released at the end.

Can you try with this?

MUnique · February 9, 2021, 10:55pm

Thanks for the reply. Yeah, I am using the retain_graph=True with the loss_decoder as well. Otherwise, I am getting the first error (Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time). With using retain_graph for these two I am getting the second one (one of the variables needed for gradient computation has been modified by an inplace operation)

Dwight_Foster · February 9, 2021, 10:56pm

With that you don’t want to use retain_graph=True anymore

MUnique · February 9, 2021, 11:01pm

Yeah, I am not using it when I use detach but I am getting the same error message.

Dwight_Foster · February 9, 2021, 11:07pm

Huh. Are you sure because in the error message you sent it says

File “…/Temp.py”, line 167, in …
loss_decoder.backward(retain_graph=True)

MUnique · February 9, 2021, 11:10pm

Yeah, I’ve tried both and it doesn’t work. Seems something else is causing the issue.

InnovArul · February 9, 2021, 11:17pm

This looks tricky to achieve with multiple optimizers.

I am not sure if this is the reason. But you are calling multiple optimizer.step()'s in between and the weights of the encoder, decoder are modified in-place due to that. What if you try the weight updates in reverse order (discriminator, then decoder, then encoder):

calculate encoder loss, decoder loss, discriminator loss

# discriminator update
optimizer_disc.zero_grad()
lossD.backward(retain_graph=True)
optimizer_disc.step()

# decoder update
optimizer_decoder.zero_grad()
loss_decoder.backward(retain_graph=True)
optimizer_decoder.step()

# encoder update
optimizer_encoder.zero_grad()
loss_ele.backward(retain_graph=True)
optimizer_encoder.step()

# to release the computation graph
lossD.backward()

InnovArul · February 9, 2021, 11:19pm

When you use .detach(), you are disconnecting the decoder from the discriminator. I am not sure if that’s what you want to do.

MUnique · February 9, 2021, 11:32pm

Yeah, you’re right. I want them to be connected to each other.
BTW, I tried what you mentioned and I am getting a similar error message:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [60000, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Here I copy my discriminator class in case it is the reason for the issue:

class Discriminator(nn.Module):
def init(self, cfg):
super(Discriminator, self).init()

    self.seq_max = cfg.Dataset.Seq_Max
    self. num_joints = cfg.Dataset.Num_joint

    if cfg.Dataset.Num_dimension == 3:
        self. input_size = self.num_joints * 3
    else:
        self.input_size = self.num_joints * 2

    self.hidden_size = cfg.TRAIN.Hidden_size
    self.batch_size = cfg.TRAIN.Batch_Size
    self.layers = cfg.TRAIN.Num_Disc_Layers
    self.dnn_layers = cfg.TRAIN.Encoder_Dnn_Layers
    self.dropout = cfg.TRAIN.Encoder_Dropout
    self.bi = cfg.TRAIN.Bidirectional_Disc
    if self.dnn_layers > 0:
        for i in range(self.dnn_layers):
            self.add_module('dnn_' + str(i), nn.Linear(
                in_features=self.input_size if i == 0 else self.hidden_size,
                out_features=self.hidden_size
            ))
    gru_input_dim = self.input_size if self.dnn_layers == 0 else self.hidden_size
    self.rnn = nn.GRU(
        gru_input_dim,
        self.hidden_size,
        self.layers,
        dropout=self.dropout,
        bidirectional=self.bi,
        batch_first=True)
    self.relu = nn.LeakyReLU(0.2, inplace=False)
    self.fc = nn.Linear(self.hidden_size * self.seq_max * self.layers, 1)
    self.nonlinear = nn.Tanh()

def run_dnn(self, x):
    for i in range(self.dnn_layers):
        x = F.relu(getattr(self, 'dnn_'+str(i))(x))
    return x

def forward(self, inputs, hidden):
    output, embedding = self.rnn(inputs, hidden)
    output = self.relu(output).contiguous()
    output = output.view(output.size()[0], -1)
    output = self.fc(output)
    output = self.nonlinear(output)

    return output

The decoder and the encoder have similar architectures.

InnovArul · February 10, 2021, 12:22am

I see. Even in this case, the final lossD.backward() faces this variable modified in-place scenario.

From @albanD 's answer here:

You can use del lossD instead of final lossD.backward() (to release computational graph). Can you try that?

Edit: Can you pack encoder and decoder into one optimizer (or) backward them together, if possible? The Encoder grad calculation is dependent on Decoder parameters as well. So, you can’t optimizer_decoder.step() before loss_ele.backward(). One solution is as follows:

calculate encoder loss, decoder loss, discriminator loss

# discriminator update
optimizer_disc.zero_grad()
lossD.backward(retain_graph=True)
optimizer_disc.step()

# encoder and decoder update
optimizer_encoder.zero_grad()
optimizer_decoder.zero_grad()

loss_generator = loss_encoder + loss_decoder
loss_generator.backward()

optimizer_decoder.step()
optimizer_encoder.step()

# to release the computation graph of the discriminator
del lossD

MUnique · February 10, 2021, 7:39pm

Thanks a lot for your responses.

This configuration seems to solve the problem:

        optimizer_encoder.zero_grad()
        optimizer_decoder.zero_grad()

        loss_ele.backward(retain_graph=True)
        loss_decoder.backward(retain_graph=True)

        optimizer_encoder.step()
        optimizer_decoder.step()

        # discriminator update
        optimizer_disc.zero_grad()
        lossD.backward()
        optimizer_disc.step()

BTW, should I still do “del lossD” after all optimization steps? What does it do?

InnovArul · February 10, 2021, 10:31pm

del lossD is not really needed if you are not storing any references of it after the for loop.
I have put it to explicitly make sure it is deleted.