Backpropagating multiple losses

I am training model 1 (using train1) with a specific loss function that involves tensor A. I am accumulating the loss and then want to perform an update. Next I am training a second model 2 (train2) in which I want to calculate the gradients wrt A using the loss calculated in train2. Thus I am adding loss 1 to loss2.

#reproduce error
from transformers import BertModel, BertForMaskedLM, BertConfig, EncoderDecoderModel
import torch
import torch.nn.functional as F
model1 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints
model2 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.001)
A=torch.rand(1, requires_grad=True)
optimizer3 = torch.optim.SGD([A], lr=0.1)

en_input=torch.tensor([[1,2], [3,4]])
en_masks=torch.tensor([[0,0], [0,0]])
de_output=torch.tensor([[3,1], [4,2]])
de_masks=torch.tensor([[0,0], [0,0]])
lm_labels=torch.tensor([[5,7], [6,8]])


def train1():
  for i in range(2):
    out = model1(input_ids=en_input, attention_mask=en_masks, decoder_input_ids=de_output, 
                        decoder_attention_mask=de_masks, labels=lm_labels.clone())

    prediction_scores = out[1]
    predictions = F.log_softmax(prediction_scores, dim=2)
    p=((predictions.sum() - de_output.sum())*A).sum()
    p=torch.unsqueeze(p, dim=0)
    acc =,acc)) # accumulating the loss 

  return loss

def train2(loss1):
for i in range (2):
   output = model2(input_ids=en_input, attention_mask=en_masks, 
                      decoder_attention_mask=de_masks, labels=lm_labels.clone())
   prediction_scores_ = output[1]
   predictions_= F.log_softmax(prediction_scores_, dim=2)
   loss2=((predictions_.sum() - de_output.sum())).sum()+loss1 # want to calculate gradients 
 wrt A
   loss2.backward(inputs=[A], retain_graph=True) 
   optimizer3.step() #update A based on calculated gradients


If this is the right method, I am not understanding whats wrong in my code? If its not right, I would appreciate if someone pointed me in the right direction.

error trace

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]] is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Based on the description it seems you are trying to use stale intermediate activations to calculate the gradients for already updated parameters, which would raise this error.
This post explains the issue in more detail using a GAN training approach.

hey @ptrblck,
Thank you so much for replying :slight_smile:. I understood my mistake. I had another ques, Is it possible for me to calculate the gradients wrt A, if I don’t add loss1 to loss2 and simply do a loss2.backward(inputs=[A]) ? Thanks.

That might be possible, as it seems A is used in the loss calculation and might not be using the aforementioned stale activations. In any case, you could just run the code and see, if Autograd would raise an error.

Well, I did try running the code without it. There is no change in the gradients of A.